Category Archives: Week 8

“Hey Siri…how can you hear me?”

In the context of this class, I may have spoken about the studio apartment I share with my three roommates – Siri, Alexa and Google. Fascinated by the capabilities and affordances of each (essentially they all fall under the category of “voice-activated digital assistants”, but each do something slightly different) I came to own all 3. The product stack of these assistants can also operate as home automation hubs, with the capacity to link everything from your lights, to your doorbell and alarm system, bringing to mind an imminent dystopian future as depicted in the “Future Son” commercial, by Geico.

Out of all the devices I use the iPhone and HomePod the most for everyday AI interactions, both of those products run the chatbot or software agent, Siri. The concepts we have learned so far are a toolbox that paves the way for us to to de-blackbox the technology and it’s unobservable layers. Firstly, let’s start with what is visible (other than the product itself): the UI or application, which is the top layer of the internet stack, is the only part that humans can see. Behind this lie several layers that work via speech recognition-NLP, data processes which boomerang back an answer to your request or question, starting with the wake word “Hey Siri!”. So how does the analog to digital, then digital to analog (in terms of alarms, lights etc) conversion work? According to Apple’s Machine Learning Journal, the “Hey Siri” wake word uses a Deep Neural Network (DNN) to convert the analog – or in this case, your voice – to digital.

The voice capabilities for Siri on iPhone, especially after last week’s unit were mostly “de-blackboxed” for me. However, I was curious as to how Siri on my HomePod overcomes the myriad challenges it faces from itself (loud music) and the surrounding environment – noise, television chatter, conversations, etc. How can Siri hear me when I am yelling at it to turn off my alarm from the bathroom (he lives in the living room), while it’s playing my podcast? Apple describes this as a “far-field setting” which works by integrating various multichannel signal processing technology which suppresses or filters noise. Here is a helpful diagram of the process:

The fact that my HomePod is, for the most part, accurately able to decode my requests in different conditions is thanks to the above process. It was helpful to learn and understand the behind the scenes magic instead of just thinking it works! As the machine learning journal article said, “next time you say “Hey Siri” you may think of all that goes on to make responding to that phrase happen, but we hope that it “just works!”


Hoy, Matthew B. (2018). “Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants”. Medical Reference Services Quarterly. 37 (1): 81–88.

Siri Team. Hey Siri: An On-device DNN-powered Voice Trigger for Apple’s Personal AssistantApple Machine learning Journal, vol. 1, no. 6, 2017.

Tech Behind Siri

Tianyi Zhao

Siri, launched by Apple Inc. in 2011, has been quite familiar to us as a voice assistant. It simplifies the navigation through our iPhone and the completion of our orders by listening and recognizing our voice. For example, Siri can tell the weather forecast, or call user’s contacts, or even tell a joke. The technologies behind Siri are mainly speech recognition and natural language processing, the two significant branches of machine learning.


Speech Recognition and Speaker Recognition

Speech recognition converts the acoustic signal from human into its corresponding textual forms. It primarily examines “what the user says”. Compared to speech recognition, Siri also leverages with speaker recognition to achieve personalization, which focuses on “who is speaking.” For instance, user can simply say “Hey Siri” to invoke Siri. However, it cannot work if any other people say the same words except the user. Enrollment and recognition become two processes to apply speaker recognition. User enrollment occurs when the user follows the set-up guidance from a new iPhone. By asking users to say several sample phrases, a statistical model for the user’s voice is created. The five sample phrases requested from the user show as below in order:

  1. “Hey Siri”
  2. “Hey Siri”
  3. “Hey Siri”
  4. “Hey Siri, how is the weather today?”
  5. “Hey Siri, it’s me.”

Figure 1.  Block diagram of Personalized Hey Siri


The figure shows how the Personalized Hey Siri proceeds. Within Feature Extraction, the acoustic input is converted into a fix-length speaker vector, including the phonetic information, background information of the environment and user’s identity. Then the speaker’s characteristics are focused and other factors – such as phonetic and environment factors – are deemphasized to achieve the accurate recognition in any circumstances. Hence the five sample phrases will generate five speaker vectors, which are stored in the user profile in each Siri-enabled device.

Natural Language Processing

Figure 2. Deep Neural Network in Siri


After Siri understands what the user is saying, the converted texts are sent to Apple servers for further natural language processing algorithms to examine the intent of the user’s words. Figure 2 shows how Deep Neural Network (DNN) works in Siri. The DNN “consists mostly of matrix multiplications and logic nonlinearities. Each ‘hidden’ layer is an intermediate representation discovered by the DNN during its training to convert the filter bank inputs to sound classes. The final nonlinearity is essentially a Softmax function.” (Siri Team, 2017)


Works Cited

Alpaydin, Ethem. Machine Learning: the New AI. The MIT Press, 2017.

Siri Team. “Hey Siri: An On-device DNN-powered Voice Trigger for Apple’s Personal Assistant.” October 2017.

Siri Team. “Personalized Hey Siri.” April, 2018.

Aman Goel. “How Does Siri Work? The Science Behind Siri.” Magoosh. Feb. 2, 2018.



A Brief History of the Future: Virtual Assistant Technologies

Back in the late 90s, my uncle proudly pulled out his flip phone at a family reunion to show me– and whoever else would listen– the “future of tech.” He proceeded to shout very limited voice commands (“CALL…..BOB!”), which the phone could register but often got wrong (“Calling…Mom”).

Fast forward a few years, and I remember getting the RAD 4.0 robot for Christmas. The TV commercials made that toy seem like a perfect robot companion and servant, like Rosey from the Jetsons or Goddard from Jimmy Neutron. RAD could respond to voice commands, move autonomously (or with a remote control), and had robot arms with clamps to pick up your toys, laundry, soda cans, etc. It even came equipped with NERF-style bullet launchers on his chest for security measures! However, after testing it out around the house, I remember being a little underwhelmed with its efficiency. I wore myself out yelling repeated commands until it would respond with an action that was usually not exactly what I had commanded. Below you can see RAD’s design and its simplistic “speech tree chart” which outlines all the verbal cues it could (supposedly) respond to.











Even as a 10 year old kid, I understood that Natural Language Processing technology wasn’t yet advanced enough to accurately understand more than a handful of commands. But I was patient, and a few years later I came across the chatbot SmarterChild, who was developed by Colloquis (acquired by Microsoft in 2006) and released on early instant messaging platforms like AIM and MSN Messenger (Gabriel, 2018). While entirely text-based (not voice-activated), SmarterChild was able to play games, check the weather, look up facts, and conversate with users to an extent. One of its more compelling canned responses came if you asked about sleep:


This was about the same time that the movie i, Robot (Proyas, 2004) came out, which contained another (somewhat chilling) quip about robots dreaming and the future of artificial intelligence:

Detective Spooner: Robots don’t feel fear. They don’t feel anything. They don’t get hungry, they don’t sleep-
Sonny the RobotI do. I have even had dreams.
SpoonerHuman beings have dreams. Even dogs have dreams, but not you. You are just a machine; an imitation of life. Can a robot write a symphony? Can a robot turn a… canvas into a beautiful masterpiece?
Sonny[with genuine interest] Can you?

Over the next decade, AI began to evolve at an unprecedented pace. Nowadays, Google Assistant has a much more complex algorithmic process (see below) for decoding language than my old friend RAD 4.0, and can provide much more natural and sophisticated interaction than SmarterChild.









These virtual assistant technologies haven’t been without hiccups in their integration, such as when I first got the updated version of iPhone with Siri included. I remember ordering at a Taco Bell drive thru while my phone was in the cupholder of the car. My order included a “quesarito” (pronounced “K-HEY-SIRI-TOE”), and when I got home I realized that Siri had “woken up” in the drive thru and was running searches on everything that was said on the car radio from the drive back. It’s incidents like these, and many other with far more sensitive or compromising information at stake, that have given people concerns about our virtual assistants always listening. But Apple has recognized these, and has gone to lengths to reduce such concerns, such as two pass detection, personalized “Hey Siri” trigger phrases, and cancellation signals for common pronunciation similarities, such as “Hey, seriously” (Siri Team, 2017).

Now, building off their popular devices such as the Echo and Alexa, Amazon is rolling out programs like Amazon Lex, where the general public can create their own conversational and text interfaces for their websites, apps, and other technologies (Barr, 2017). This is a huge step for the integration of AI, machine learning, and deep neural networks into the public sphere, making it accessible on a much wider scale than the computer scientists in Silicon Valley.

The big question that comes to mind, as always, is what’s next? Despite most of the above evidence being anecdotal, it does show a massive progression in the field of artificial intelligence over the past 20 years. Does the evolution of virtual assistant technologies continue to accelerate behind the rapid progress in fields like machine learning and natural language processing? Where does it end? Will we become too dependent on these technologies? If so, what if they fail? Will there eventually be a cultural backlash?

“Hey Siri, what will the future look like?”

Barr, J. (2016, November 30). Amazon Lex – Build Conversational Voice & Text Interfaces. Retrieved from
Barr, J. (2017, April 19). Amazon Lex – Now Generally Available. Retrieved from
Proyas, A. (2004). i, Robot. Retrieved from
Siri Team. (2017, October). Hey Siri: An On-device DNN-powered Voice Trigger for Apple’s Personal Assistant. Retrieved from
Gabriel, T. (2018, February 27). The Chatbot Revolution is Here. Retrieved from

Virtual Assistants and their Personalities

Whether or not a person has had direct experience with virtual assistants, many can describe the voice and even some of the quirky comebacks that virtual assistants can give to a user. This is largely due to the integration of virtual assistants into popular culture and media. However, the operations of virtual assistants are wildly unknown, except for the need to be connected to the internet to work properly. A deeper look into the process of how virtual assistants work shows that the voice speaking to the virtual assistant is the input which is then converted into a sequence of frames. Then a deep neural network processes the input in order to produce an output. This process is designed to assess the probability that the input matches the existing patterns of  sequence and produces an output. The output can be an answer to a question, an action or a negative response indicating an error with the initial input.

It is interesting that regardless of how advanced that virtual assistants may seem, the input needs to follow an existing set of rules in order to produce a positive output.  These rules are based on  grammar rules as well as common phrases.  The negative outputs have provided developers with room to creatively present the “personality” of the assistant. This can be clearly demonstrated when an iPhone user asks Siri what is zero divided by zero or even when an Amazon Echo asks Alexa where to buy a Google Home. The designed personalities of virtual assistants seem to be the main product differentiators across brands. As we know from several classes, the actual operations of these systems are the same, it is the brand that makes each virtual assistant seem unique.

Apple Machine Learning Journal (1/9, April 2018): “Personalized ‘Hey Siri’.

Hey Google… What Are You?

When deblackboxing a speech-activated virtual assistant application like Google Home, you begin to see some parallels between that and other virtual assistant applications (like Siri). Using a mix of structured and unstructured data, Google Home’s machine learning processes takes note of the information we provide to it, and through machine learning/convolution neural networks, Google Home begins to accommodate and adapt to the primary user of the virtual assistant.

The structured data can come from direct sources of information – Google Home has a functionality where users are able to use typed input for commands and visual responses (Google Assistant, Wikipedia), which can constitute as direct data. Additionally, the information Google Home collects through direct verbal actions are direct forms of data which would then be logged for both machine learning purposes and future predictive interactions on behalf of Google Home. In regards to unstructured data, Google Home surely collects data from indirect forms of communication that the user conducts in with any account linked to the Google Home. This could mean your email, texts, contacts, Spotify, YouTube… essentially any device or application that you link with your Google Assistant (Google Assistant, Google). Based on the patents for intelligent automated assistant, the two inputs – user input and other events/facts – supports the direct and indirect, structured and unstructured data inputs that Google Home both listens too and records information on. From there, the virtual assistant application begins to break down the requested input/command and breaks it up into groups to determine what is being said, what needs to be done (in the most efficient matter based on action patterns), how it will be done, and what will be said (Intelligent Automated Assistant, Google Patents). Once the virtual assistant determines all of that within seconds, the initial requested input is then outputted into the form of words and actions. The patent application also describes the “parts” of a virtual assistant: input, output, storage, and memory – which are the four core “interactions – followed by the overall processor that decodes and recodes the input, and lastly the overall machine itself which is the intelligent automated assistant. It’s important to recognize that all parts of a virtual assistant work together in a network to achieve the common goal at hand. That’s what makes it an intelligent machine learning service.

Work Cited:

“Hey Siri” – the DNN Acoustic Model

In our work de-blackboxing Google Translate, we learned the DNN’s job is to solve a problem. In the case of Google Translate the problem was to translate a phrase or sentence from one language(input) to another(output). In the case of a voice triggered personal assistant, the DNN will need to decode a voice command (input) and perform a task or answer a question (output). A Recurrent Neural Network was needed for Google Translate and for Apple’s Siri, a DNN Acoustic Model.

Layers / Process of Siri Voice Trigger 

(If using Siri on an iPhone)

  1. The microphone in your phone converts the sound of your voice into waveform samples 
  2. Spectrum analysis stage converts the waveform to a sequence of frames 
  3. ~20 frames at a time, are fed to the Deep Neural Network (DNN)
  4. Then, “The DNN converts each acoustic pattern into a probability distribution over a set of speech sound classes those used in the “Hey Siri” phrase, plus silence and other speech, for a total of about 20 sound classes (Siri Team, 2017). 


(Image retrieved from:

According to the  Apple’s Machine Learning article, An iPhone uses two networks (1. Detection, 2. Secondary Checker)

5. The way that the acoustic pattern is further detected is: if the outputs of the acoustic model have a high enough phonetic score for a target phrase. This process is further solidified through training –  over time, the more times a phrase is detected accurately – the more valid the sequence becomes. This process is shown in the top layer of image above as a recurrent network with connections to the same unit and the next in sequence (Team Siri, 2017). 

The DNN “hidden” layers in the neural network consist of learned representations during the training period of taking acoustic pattern (input) to sound classes (output).

In order to recreate the voice Siri’s voice trigger system – the main components we would need:

Hardware, Software and Internet Services

  1. A device with Internet connection (phone, smartwatch, bluetooth device)
  2. A microphone 
  3. Detector
  4. An Acoustic Input (voice)
  5. Server (can provide updates to acoustic models)
  6. Deep Neural Network — 2 networks : 1. Detection 2. Second Pass
  7. Training Process for the DNN
  8. Motion Coprocessor (to avoid using up battery life at all times the voice trigger is not being used)
  • Note: I have further questions about whether additional components listed in the below diagrams are a part of the above main features or if they need to be included as separate entities

This image looks at the active speech input procedure as a flow chart and includes the process of ranking interpretations for semantic relevance (process mentioned above) – this was also a key feature in the Google Translate process.

(image retrieved from:

Description of Automated Assistant from Apple Patent

“The conversation interface, and the ability to obtain information and perform follow-on task, are implemented, in at least some embodiments, by coordinating various components such as language components, dialog components, task management components, information management components and/or a plurality of external services” (Siri Team, 2017).

This quote is expressed in a useful image below – and helps to visualize the coordination of the components mentioned above.

(image retrieved from:



Siri Team. “Hey Siri: An On-Device DNN-Powered Voice Trigger for Apple’s Personal Assistant – Apple.” Apple Machine Learning Journal,
Gruber, Thomas Robert, et al. Intelligent Automated Assistant. US20120016678A1, 19 Jan. 2012,

De-blackbox the Algorithms of Netflix Recommendation

What makes me most interest in this week’s reading is “recommendation system” because it has been very commonplace in a variety of areas, such as Facebook news, Instagram, music App, etc. According to the recent survey conducted by Pew Research Center, most U.S. people confirmed that their social media could accurately define their key characteristics, such as hobbies and interests and etc. In fact, I am increasingly surprised by how my cell phone knows me so well by recommending me new videos and music that amazingly fit my taste.

According to the Wikipedia, recommendation systems typically produce a list of recommendations in one of two ways – through collaborative filtering or through content-based filtering. I would like to use Netflix as an example to explain how recommender systems works. Actually, Netflix combines the two kinds of recommendation system method. The website makes personal recommendations by comparing the watching and searching habits of similar users (i.e., collaborative filtering) as well as by offering movies that share characteristics with films that a user has rated highly (content-based filtering). (Wikipedia)

The video below gives more detailed explanation to how Netflix recommendation system works. In Netflix, a huge matrix factorization was created based on 2000 users’ previous rating and 1000 movies features with a kind of training model so that recommend every user’s fittest movie based on their movie preference. Many math’s calculation and error correction were continuously conducted in the process. To some extent, Netflix might know our movie preference better than ourselves.



Context is Everything

The traditional voice of a computer – that electronic, crackling voice that you can almost feel the circuitry in – is a Frankenstein of phonemes that a program could piece together to create an illusion of a semi-auditory computer interface. By creating these phoneme mashups, computers could speak to the user. But, this interface feels markedly inorganic. As humans speak to each other, the intonation and tone that they use to inflect certain phonemes changes given the context of the phoneme. An “R” sound that I make when I exclaim, “Great!” sounds different from the “R” that concludes the interrogative “Can you hand over that weed-whacker?”

With a digitized catalogue of every phoneme, you might imagine that the issue of conversationally interfacing with computers would be solved. In my mind, the process of digitally “hearing” and synthesizing a voice would be the most difficult within a conversational interface. But, of course, from my human vantage point, the digitization is the hardest part. Humans have a capacity to intuitively understand language that is often taken for granted. When I read a sentence, I don’t have to sit for a second to parse through the relationships that each word has with the sentence as a whole and /then/ consider how that sentence as a whole interacts with the sentences and world around it. I just intuitively understand the sentence.

Early chatbots and vocalizers could give the illusion of “natural conversation.” But, early chatbots were simply branching algorithms whose responses were predetermined and depended on proper inputs. To achieve actual conversational interface, computers would need to understand the context and “meaning” of what was said to them.

Google uses “The Knowledge Graph” to semantically link together information across the web. Instead of “Apple” only existing as a simple string of letters, it is now, in Google’s eyes a symbol that is linked to “seeds” and “computers” and “pies.”

Image source: Digital Reasoning

As the web became an increasingly gigantic web of semantic information linked together in “meaningful relationships,” the contextual nature of conversationally transmitted data became easier and easier for a computer to understand (Crash Course Video). Learning from a dataset that links “dinosaur” with “extinction” and “triceratops” and “cute” helps give natural language processing a better grasp of how humans use language.

Deblackboxing Siri as a virtual assistant

Virtual assistant is an emerging topic in artificial intelligence field. It can perform tasks for its users based on verbal commands. It is normally implanted in digital devices like smart phones, personal computers, and smart speakers. Apple’s Siri is one of them. Siri is voice-activated by personalized “Hey, Siri” and then it provides information or performs tasks as commanded. The procedure is composed of various layers and each layer is responsible for specific task or tasks. It would be clearer to deblackbox it by layers

According to Apple’s Patent Application for “An intelligent automated assistant system”, a system for operating an intelligent automated assistant includes

  • one or more processors that start with the Detector

The Deep Neural Network (DNN) is used to detect “Hey Siri.” First, the microphone turns your voice into a stream of waveform samples, and then these waveform samples are converted to a sequence of frames through spectrum analysis. DNN converts each of these acoustic patterns into a probability distribution. “Hey Siri” can be detected if the outputs of the acoustic model fit the right sequence for the target phrase. After Siri is activated, it can perform tasks as requested.

  • memory storing instructions that cause the processors to perform operations, including
  • obtaining a text string from a speech input received from a user

For example, if I want my iPhone to call my Mom while I am driving, I would say “Hey, Siri” to activate Siri, and then say “call Mom” to give a command. Through speech recognition, my speech will be turned into a text string than can be processed by the processor.

  • interpreting the received text string to derive a representation of user intent

Through NLP, the processor interprets “call Mom” as an instruction to dial a person who is remarked as “Mom” in the contacts.

  • identifying at least one domain, a task, and at least one parameter for the task, based at least in part on the representation of user intent

After interpretation, this layer links my instruction to “Phone” domain and opens “Phone” function.

  • performing the identified task

My iPhone calls “Mom” using the phone number I saved in the contacts.

  • provide an output to the user, wherein the output is related to the performance of the task.

The procedure above is a simplified version of how Siri receive and perform our verbal instructions. It is noticeable that there are nested complicated layers implanted in each layer which are waiting to be deblackboxed.



Apple Machine Learning Journal (1/9, April 2018): “Personalized ‘Hey Siri’.

Google Assistant

Her (Jonze, 2013) installing OS1/Samantha

This week’s focus is on A.I. and specifically virtual assistants. As a fan of cinema, sci-fi, and representation of technology in the moving image, I can’t help but think of a few examples such as Her (Jonze, 2013), A.I. Artificial Intelligence (Spielberg, 2001), Ex-Machina (Garland, 2015), Blade Runner (Scott, 1982), Minority Report (Spielberg, 2002), 2001: A Space Odissey (Kubrick, 1968), and the list goes on and on.

I must confess that I’m not a big fan of voice recognition virtual assistants. I don’t have an Amazon Echo, Google Home and I’ve deactivated the “listen for Hey Siri” option on my iPhone. Digging deeper into the reasons for my dislike, I’ve come to the conclusion that it has to be because I was first exposed to all these dystopian films before being given the tools to actually understand how do the technology works. These fictional representations often present these technologies exaggerated/distorted with some ‘truth’ at its core. Watching these films doesn’t necessarily prevent me from de-blackboxing AI or voice recognition virtual assistants, but it definitely provides a filter through which we can understand not only how they work but how users understand and interact with them

While reading through the Google Assistant patent I was surprised at finding that, although most of the specifications are too technical for my understanding, the main description of its use and purpose was very accessible and even more clarifying than most attempts from articles to ‘unveil’ the mystery to the reader.

The patent reads:

“According to various embodiments of the present invention, intelligent automated assistant systems may be configured, designed, and/or operable to provide various different types of operations, functionalities, and/or features, and/or to combine a plurality of features, operations, and applications of an electronic device on which it is installed.”

Based on this excerpt, the patent describes the system as an intermediary between the user and many possible outcomes/actions that are already available in the devices, accessible through different modes of interaction.

If we look into the different levels/layers/steps into how Google Assistant works, the patent describes:

  • “…actively eliciting input from a user,
  • interpreting user intent,
  • disambiguating among competing interpretations,
  • requesting and receiving clarifying information as needed,
  • and performing (or initiating) actions based on the discerned intent.

Those actions can vary from activating and/or interacting with other applications and services already on the device, or accessible through the Internet: it can perform a google search on your question and provide answers, it can activate google maps or Spotify, it can perform e-commerce interactions such as buying things on Amazon, among others.

Some of the language used through the description in the patent was interesting to me. At one point it says, “[thanks to the assistant] The user can thereby be relieved of the burden of learning what functionality may be available on the device and on web-connected services, how to interface with such services to get what he or she wants, and how to interpret the output received from such services; rather, the assistant… can act as a go-between between the user and such diverse services.”

Oh to be relieved of the burden of learning how something works. This [insert any technology here] makes life so much easier we shouldn’t concern ourselves with the technicalities of how does it work.

I will admit that the benefits of voice recognition virtual assistants are massive for different communities and fields of work. The patent describes in detail how this serves people with disabilities and users who work handling machinery and cannot interact with devices at the same time without shifting their attention, which could be possibly dangerous. Not just for work, a great example is making a call or searching for something while driving.

Although all of this is true and valid, it must be acknowledged that it also opens the door to many vulnerabilities and security issues for users, as many technologies do. Cases of stolen identity, e-commerce fraud, home security, children protection, scams, etc. Last year, the New York Times published an article regarding research studies from various US and China universities on malicious use of these technologies, specifically “Berkeley researchers published a research paper that went further, saying they could embed commands directly into recordings of music or spoken text. So while a human listener hears someone talking or an orchestra playing, Amazon’s Echo speaker might hear an instruction to add something to your shopping list.

Therefore, there should be concern. Not the dystopian sci-fi movie’s fear around technology taking over, but about humans using these technologies to take advantage of the users. As much as I love/hate the greatest villain in film (in my humble opinion) Hal9000, I admit the threat of an embedded hidden command that I cannot hear but Echo can, seems exponentially more terrifying.

2001: A Space Odyssey (Kubrick, 1968). Hal9000