“Hey Siri…how can you hear me?”

In the context of this class, I may have spoken about the studio apartment I share with my three roommates – Siri, Alexa and Google. Fascinated by the capabilities and affordances of each (essentially they all fall under the category of “voice-activated digital assistants”, but each do something slightly different) I came to own all 3. The product stack of these assistants can also operate as home automation hubs, with the capacity to link everything from your lights, to your doorbell and alarm system, bringing to mind an imminent dystopian future as depicted in the “Future Son” commercial, by Geico.

Out of all the devices I use the iPhone and HomePod the most for everyday AI interactions, both of those products run the chatbot or software agent, Siri. The concepts we have learned so far are a toolbox that paves the way for us to to de-blackbox the technology and it’s unobservable layers. Firstly, let’s start with what is visible (other than the product itself): the UI or application, which is the top layer of the internet stack, is the only part that humans can see. Behind this lie several layers that work via speech recognition-NLP, data processes which boomerang back an answer to your request or question, starting with the wake word “Hey Siri!”. So how does the analog to digital, then digital to analog (in terms of alarms, lights etc) conversion work? According to Apple’s Machine Learning Journal, the “Hey Siri” wake word uses a Deep Neural Network (DNN) to convert the analog – or in this case, your voice – to digital.

The voice capabilities for Siri on iPhone, especially after last week’s unit were mostly “de-blackboxed” for me. However, I was curious as to how Siri on my HomePod overcomes the myriad challenges it faces from itself (loud music) and the surrounding environment – noise, television chatter, conversations, etc. How can Siri hear me when I am yelling at it to turn off my alarm from the bathroom (he lives in the living room), while it’s playing my podcast? Apple describes this as a “far-field setting” which works by integrating various multichannel signal processing technology which suppresses or filters noise. Here is a helpful diagram of the process:

The fact that my HomePod is, for the most part, accurately able to decode my requests in different conditions is thanks to the above process. It was helpful to learn and understand the behind the scenes magic instead of just thinking it works! As the machine learning journal article said, “next time you say “Hey Siri” you may think of all that goes on to make responding to that phrase happen, but we hope that it “just works!”

References

Hoy, Matthew B. (2018). “Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants”. Medical Reference Services Quarterly. 37 (1): 81–88.

Siri Team. Hey Siri: An On-device DNN-powered Voice Trigger for Apple’s Personal AssistantApple Machine learning Journal, vol. 1, no. 6, 2017.