Category Archives: Week 8

It is Still for the Specific Tasks

Google assistance or other virtual assistants is “like a shortcuts to parts of app” (App Actions Overview | Google Developers, n.d.). I can activate the Google assistance by saying “Hey Google” and ask it to play a movie on my phone. In addition, we can also speak to it to book a restaurant or add a memo. Outside the black box of Google assistance, we can see that user activate the assistance and give it an unstructured command. Then the Google assistance analyzes the words and return order to specific apps to get the right answers or actions.

Fig1. data flow outside the blackbox – Source from App Actions Overview | Google Developers, n.d.

What’s in the black box? First, the questions or commands spoke by users are transformed into text (human representations). This process is called Automatic Speech Recognition (ASR). The user’s sound will be first stored in FLAC or WAV files and transmitted to Google’s server system. In the system, the data will be undergo signal processing and feature extraction by ASR and then encoded into vectors. Then ASR uses the trained acoustic model and language model to obtain scores respectively, combines these two scores to perform a candidate search, and finally gets the language Recognized result. After decoding the result, we finally get a text corresponding to the voice.

Second, since the users’ query might be unstructured, the text should be changed into a structured query and classified to the right model. By the way, unstructured means people have many different ways to ask for a same things. For example, “how’s the weather today” and “what is today’s weather forecast” both ask for the same information, but because of the many reasons, the way to ask questions is different. For this part, the NLP will use language pattern recognizers to map text with vocabulary databases, get the semantic matching and rank all the candidates to find the most likely matching. After that, the Google assistance can match the result to the specific task model like domain models, task flow models or dialog flow models.

Fig2. NLP procedure –  Source from Gruber et al., 2017

Third, return output depends on the models results. “When a user’s query matches the predefined pattern of a built-in intent, Assistant extracts query parameters into entities and generates an Android deep link URL” (App Actions Overview | Google Developers, n.d.). In other words, based on the users’ commands the Google assistance will return results which people can understand to meet the requirement. If you want to watch adventure movie, it might activate the Netflix app or just give you a list of adventure movies. The difference of output depends on whether the Netflix app use the API with Google assistance. It is worth mentioning that Google duplex can help users book a restaurant or something like that by automatically talking to shop assistants with a phone call. “At the core of Duplex is a recurrent neural network (RNN) designed to cope with these challenges, built using TensorFlow Extended (TFX)” (“Google Duplex,” n.d.).

In short, though the Google assistance or other virtual assistants show like a human in some way, which means you can talk to it, you can ask it to do something only human can do before, it is still designed for specific tasks. It just recognizes and classifies people’s commands and follow different models to finish the tasks.



What is the difference between BERT and Google duplex? BERT is used for the Google search, but it seems that its effect is similar to the duplex in some way.



App Actions overview | Google Developers. (n.d.). Retrieved March 16, 2021, from

Conversational Actions. (n.d.). Google Developers. Retrieved March 15, 2021, from

Google Duplex: An AI System for Accomplishing Real-World Tasks Over the Phone. (n.d.). Google AI Blog. Retrieved March 16, 2021, from

Gruber, T. R., Cheyer, A. J., Kittlaus, D., Guzzoni, D. R., Brigham, C. D., Giuli, R. D., Bastea-Forte, M., & Saddler, H. J. (2017). Intelligent automated assistant (United States Patent No. US9548050B2).

Speech-to-Text basics | Cloud Speech-to-Text Documentation. (n.d.). Retrieved March 16, 2021, from

Siri’s Road to Accurate Speech Recognition

The set of operations involved in the eventual transformation of sound patterns into data with Siri begins with an acoustic wave detector, in Siri’s case, this is the M-series motion coprocessor, or “AOP” (Always On Processor).  The significance of this processor is how it does not require the main processor to be running in order to activate Siri (on mobile devices). The M-series detects the acoustic waves associated with the activation phrase “hey Siri” using MFCCs (Mel Frequency Cepstrum Coefficients) to transform the sound waves into coefficients to be then used to produce sound represented in a data form. Once these coefficients are produced, they are run through a frame buffer (RAM Random Access Memory) and transformed into pixels/bits. Once the sound has been transformed into bits, they are input into a small DNN with 32 hidden units. This small DNN uses an HMM (Hidden Markov Models) a statistical model that produces a score which in turn decides whether to activate the main processor or not.  Once the main processor is activated, a new DNN is accessed, with 196 hidden units, and this DNN also utilizes HMM to produce the most accurate interpretation of speech.

The road to accurate speech recognition has been a long process which required more crude techniques in the beginning stages of Siri. Initially, Siri required a user to manually activate her before providing commands, this allowed teams at Apple to collect data from the initial phases of Siri to be used later with remote activation. The early stages of Siri provided a speech corpus (audio file database) for which later versions of Siri to access, these larger audio file databases made the DNN’s coupled with HMM’s more accurate. Supervised standard backpropagation algorithms are used to reduce errors, and stochastic gradient descents are used for optimization of the algorithms. Siri, as with most other machine learning based programs is a work in progress, and can only be improved upon as the acquisition of data and more efficient algorithms becomes more available. 


How do bandwidth limitations effect the accuracy of speech recognition and virtual assistants?

I understand hidden units are mathematical algorithms used within a DNN, but how are they separated, or are they separated at all? Why are the number of hidden units in a DNN incremented in layers? 

In Apple’s breakdown of how Siri works, it glanced over the lower levels of sound wave input into the device, and did not breakdown how the sound waves become data, it simply states “acoustic input” but what hardware in the phone transforms the sound waves into an electrical signals? 

At what stage in the process is sound transformed into pixels then transformed into text, and does this involve interaction with NLPs in conjunction with the speech recognition processes?

Lastly, it is still unclear to me what purpose the framebuffer serves in the operations leading to speech recognition?


Acoustic model. (2020). In Wikipedia.

Backpropagation algorithm—An overview | sciencedirect topics. (n.d.). Retrieved March 15, 2021, from,solution%20to%20the%20learning%20problem.

Framebuffer. (2020). In Wikipedia.

Hey siri: An on-device dnn-powered voice trigger for apple’s personal assistant. (n.d.). Apple Machine Learning Research. Retrieved March 15, 2021, from

Hidden layer. (2019, May 17). DeepAI.

Mel-frequency cepstrum. (2020). In Wikipedia.

Paramonov, P., & Sutula, N. (2016). Simplified scoring methods for HMM-based speech recognition. Soft Computing, 20(9), 3455–3460.

Stochastic gradient descent. (2021). In Wikipedia.

Ok Google

This was yet again an exciting week for me as we focused more on IPAs and that is what I have been mostly focusing on since coming to CCT* as I wanted to continue what I was learning and working on during my undergrad years studying and analyzing IPAs! From studying IPAs and Alexa coming out during my undergrad times, I can’t say that there wasn’t an uneasiness of some sort surrounding this topic. Having a device in the same room as you that can constantly hear you or record you and of course keeps track of info and data, can sound pretty scary especially as a new advancement. So I tended to stay away from having my own device (not including Siri in this – because I did use Siri beforehand) yet still found the concept extremely intriguing. For me, it is the closest thing that we have to a human-like-robot as part of our daily lives. When I moved to DC a family friend gave me a Google Home as a house warming gift so we could share our photos with each other as it displays them on the screen. Since it was a gift I kept it and gave it a try and have been using it since then which is why I decided to focus on Google Assistant for this post. 

Google Assistant comes in a mobile and smart home (Google Home) versions and was initially debuted in 2016. The assistant interacts through human voice and dialogue and provides results or executes commands based on the users verbal demands. 

*There is a research paper(s) that goes along with this, please feel free to reach out if you’re interested! 

How Siri Works

Natalie Guo

Q1: What is Siri?

A1: Siri is an Intelligent Virtual Assistant (IPA) or Intelligent Personal Assistant (IVA), or a Chatbot in common words.

Q2: What does Siri do?

A2: Siri can perform phone actions and natural language interface based on voice/verbal command. It can also perform remote instructions or ganged with third-party apps to better satisfy users’ needs.

Q3: What techniques does Siri need to accomplish the tasks above?

A3: Speech Recognition Engine + Advanced ML tech + Convolutional Neural Network + Long Short-term Memory + Knowledge Navigator + text-to-speech voice based on deep learning technology.

Siri doesn’t “recognize our voice or understand our commends”, it translates the info into digital data/test messages that it can process and match with its database. One possible solution is “recognizing” the info as pieces of sound waves. And notice, each cluster may represent a specific word, when those clusters combine, it generated into a sentence. Inside the Blackbox, a huge database collects a massive amount of “voice wave” samples to let Siri select and learn which cluster represents what natural language meaning How does Siri work? (2011, December 20).

Then, the algorithm behind Siri, the Natural Language Processing is driven by ML techniques, takes away (please correct me if I’m wrong). Siri was made to pick up keywords and important phrases. During the text – speech process, a function called PRAAT, which is developed by Nuance, can take the waveform, turn it into a spectrogram, and create phonetic labels (which recognize the vowel), stress labels, pitch labels, and further decide which part get selected during the interface.

The article of Hey Siri: An On-device DNN-powered Voice Trigger for Apple’s Personal Assistant, it explains in detail. First, the DNN-power (Deep Natural Network) voice trigger keeps Siri “in the cloud” which can hear the user’s command of “Hey Siri” at any moment, then it computes the confidence score to identify if you actually want to wake Siri up.  The two layers used in Siri, one is for detection and the other is for checking.

Question: Siri seems to have trouble when the user suddenly needs to change the command, or punctuate a run-on sentence when there are several subjects occur in the same sentences. Why does it happen and what do we need to work on to make it better?  


Hey Siri: An On-device DNN-powered Voice Trigger for Apple’s Personal Assistant. (n.d.). Apple Machine Learning Research. Retrieved March 15, 2021, from

How does Siri work ? (2011, December 20). [Video]. YouTube.

Inside Nuance: the art and science of how Siri speaks. (2013, September 17). [Video]. YouTube.

This Is The Algorithm That Lets Siri Understand Your Questions | Mach | NBC News. (2017, June 28). [Video]. YouTube.

Weekly Takeaways


The virtual assistant (VA) is now everywhere in our daily life. “It is a software agent that can perform tasks or services for an individual based on commands or questions” (Wikipedia). The well-known Vas- Alexa, Siri, Cortana, etc.- have different focuses. This week I will be concentrating on Amazon Lex’s chatbots function and attempting to de-blackbox it.

“Amazon Lex is a service for building conversational interfaces into any application using voice and text, which powers the Amazon Alexa virtual assistant” (Wikipedia). According to Amazon Web Service (AWS), “Amazon Lex is a service for building conversational interfaces into any application using voice and text.” It offers advanced deep learning functionalities of automatic speech recognition (ASR) to convert speech to text and natural language understanding (NLU) to identify text to build applications with highly immersive user interfaces and life-like conversational interactions. It usually works with other programs to form a well-functional application architecture like Echo and Alexa.

As it says above, Lex involves in ASR & NLU. For the speech (ASR) part, users speak to the software via an audio feed, and the computer will accordingly create a wave file of words, which will be cleaned by removing background noise and normalizing volume. The filtered waveform will be broken down into small parts- phonemes. Each phoneme is like a chain link and by analyzing them in sequence. The ASR algorithm (RNN we learned last week is an ASR algorithm) uses statistical probability analysis from the first phoneme to deduce whole words and then, from there, complete sentences. When the program knows the sentence, it will provide reasonable responses to users based on its dataset. For the text (NLU) part, still focusing on RNN- the encoder-decoder architecture, users input words or sentences, which will be converted to numeric values-vectors by the algorithm so that the computer will understand. Again, when the program knows the sentence, it will provide reasonable responses to users based on its dataset. The dataset could be set by developers through supervised learning and be accomplished by unsupervised learning.

I recommend checking this workshop case related to Amazon lex collaborating with other APIs ( This is an easy implementation of AWS’s modules.


Week 8 Reflection


Designing a virtual assistant is no simple task, but to do so, we would need to think about how virtual assistants work. They work via text (online chat, especially in an instant messaging app or another app, SMS Text, e-mail), voice (Amazon Alexa, Siri, Google Assistant), and through taking and/or uploading images (Samsung Bixby on the Samsung Galaxy S8). As a broad overview, virtual assistants use NLP to match user text or voice input to executable commands. These assistants continue to learn over time by using artificial intelligence techniques including machine learning. 

Before Apple integrated its hands-free virtual assistant, it began allowing users to use Siri by first pressing the buttons of their home screens and then followed by saying “Hey Siri.” This is an important step in the process of developing hands-free virtual assistants because it tells us how Apple trained its technologies. The users’ “Hey Siri” utterances used for the initial training set for the US English detector model. They also included general speech examples, as used for training the main speech recognizer. To check the initial automatic transcripts for accuracy, Apple hired a team of people to monitor the data that would be the foundation of the program was correct.

Apple products, like many virtual assistant products, are built with a microphone. This is responsible for capturing audio, which turns our voices into a stream of instantaneous waveform samples at a rate of 16000/second. After accumulating these waveforms, they are converted into a sequence of frames that each describes the sound spectrum of approximately 0.01 sec. These are fed into a Deep Neural Network acoustic model, which converts the acoustic patterns into a probability distribution over a set of speech sound classes. For example, those used in the phrase “Hey Siri” (accounting for silence) total to about 20 sound classes. 

In order to keep the technology hands free and therefore activate upon command, a small speech recognizer runs all the time and listens for just its ‘wake word’.  In iPhones, this is known as the Always On Processor (AOP). While Apple uses “Hey Siri,” other well-known wake words include “OK Google” or “Hey Google”, “Alexa”, and “Hey Microsoft.” When the speech recognizer detects the wake word(s), the device parses the speech that follows as a command or query.

Once the acoustic patterns of our voice at each instant are converted into a probability distribution over speech sounds, a temporal integration process computes a confidence score that the phrase you uttered was in fact the wake word. If the score is high enough, the virtual assistant wakes up. It is also important to note that the threshold to decide whether to activate Siri is not a fixed value.

The Deep Neural Network acoustic model, once trained with not only our wake word but also some sort of corpus of speech allows virtual assistants to provide a sound class label for each frame and ultimately estimate the probabilities of the states given the local acoustic observations. “The output of the acoustic model provides a distribution of scores over phonetic classes for every frame. A phonetic class is typically something like ‘the first part of an /s/ preceded by a high front vowel and followed by a front vowel.’”

Once the question/task is converted into speech waves and processed through the DNN, Apple licenses Wolfram Alpha’s Knowledge Base. This knowledge base is able to respond to fact-based questions, with the example from Wikipedia as such: “How old was Queen Elizabeth II in 1974?” Wolfram Alpha displays its “input interpretation” of such a question, using standardized phrases such as “age | of Queen Elizabeth II (royalty) | in 1974”, the answer of which is “Age at start of 1974: 47 years”, and a biography link. 

In terms of a virtual assistant’s voice, after databases have been trained, many companies hire local voice talent and have them read books, newspapers, web articles, and more. These recordings are transcribed to match words to sounds in order to identify phonemes, the individual sounds that make up all speech. “They try to capture these phonemes spoken in every imaginable way: trailing off at the end of the word, harder at the beginning, longer before a pause, rising in a question. Each utterance has a slightly different sound wave…every sentence Siri speaks contains dozens or hundreds of these phonemes, assembled like magazine cut-outs in a ransom note. It’s likely that none of the words you hear Siri say were actually recorded the way they’re spoken”(Wired). As companies continue to hunt for the right voice talent, they run the speech of those who audition through the models they’ve built looking for phoneme variability—”essentially, the sound-wave difference between the left and right side of each tiny utterance. More variability within a phoneme makes it hard to stitch a lot of them together in a natural-sounding way, but you’d never hear the problems listening to them speak. Only the computer sees the difference” (Wired). Once the right person is found who sounds right to both human and computer, they are weeks at a time, and that becomes the voice of the virtual assistant.


Apple Machine Learning Research. “Hey Siri: An On-Device DNN-Powered Voice Trigger for Apple’s Personal Assistant.” Accessed March 15, 2021.

“How Apple Finally Made Siri Sound More Human.” Wired. Accessed March 14, 2021.

“Virtual Assistant.” In Wikipedia, March 10, 2021.

“WolframAlpha.” In Wikipedia, March 13, 2021.


I’m having a hard time understanding the mathematical side of the DNN Can you please explain?
“The DNN consists mostly of matrix multiplications and logistic nonlinearities. Each “hidden” layer is an intermediate representation discovered by the DNN during its training to convert the filter bank inputs to sound classes. The final nonlinearity is essentially a Softmax function (a.k.a. a general logistic or normalized exponential), but since we want log probabilities the actual math is somewhat simpler.”

Siri: She may be ridged but she works

Apple’s Siri hit the market in 2010, with full IOS integration in 2011, becoming the famous counterpart to Alexa, and the Google Assistant. But how does it work?

Originally created and integrated as a third-party app, Siri uses many layers to attempt to understand its users input and create a helpful reply based on that input.

It’s starts with the command, it processes the command and how long this command statement is by using two things, a limit to how long of a pause there can be in a statement to determine the end of a statement or when the system memory determines that the statement is too long for memory and cut off after a certain amount of imput.

This moves the language being spoken to the Natural language processor which compares what was spoken to what it believes certain words or phrases sound like using a predictability matrix to determine the most likely candidate for both individual words and now phrase statements. This uses both a dictionary of words and phrases which are the most likely to be uttered before returning to the reply. This process has an emphasis on identifying the individual words which make up the command leaving commands to be as close to possible to the user input.

This is then passed up to the cloud services for them to determine what do with the transcribed statement being used, returning both the full transcription of what Siri believes the user to have said and the response based on that prompt given.

This means that attempts to ping certain apps based on the given statement and use these networked services to provide an answer. Siri is not usually the one directly providing the answer but perhaps using a voice to text feature to read a small prompt back to the user before verifying that Siri’s response is correct or asking for more or another input

Siri is based on an external working system that relies on other apps and services to return the result wanted for the user. This means that outside of the programmed networked apps such as calendar and directions, Siri relies on the internet, Wolfgram alpha, and other integrated services to return the response to the user. This makes Siri ridged as these commands are not being handled directly by the NLP and require certain keyword statements to achieve the desired result. These keyword statements do not flow as well as natural statements being said but act more as if you had typed a statement into a google search bar and found the top result.

Siri has improved since its inception, integrating better language processing and abilities but still utilized external applications to the bulk of the heavy lifting. Siri is a virtual liaison mostly, as it tossed much of the usability to other platforms leading it to most just handle the Natural Language Processing and to give process and summarize results of the inquiry.

Information retrieved from:

De-black boxing of virtual assistant (Alexa)


Week 8

Alexa is a well-known virtual assistant developed by amazon using AI in 2014 (Wikipedia, 2021). Alexa can play music, interact with our voices, make to-do lists, setting alarms, provide weather information, etc. (wikipedia, 2021). We can use Alexa as a home automation system controlling our smart devices (Wikipedia, 2021) (Amazon, 2021). Besides that, we can install extension functionality called skills, adding them to Alexa. Device manufactures can integrate Alexa voice capabilities in their products using the Alex voice service. In this way, any products built with this cloud-based service have access to a list of automatic speech recognition and natural language processing capabilities. Amazon uses the long-short term memory LSTM for generating voices (Amazon, 2021). In 2016, Amazon released Lex, making the speech recognition and natural processing language NLP available for developers to create their chat-bots (Barr, 2016). Less than a year later, Lex became generally available (Barr, AmazonLex–NowGenerallyAvailable, 2017). Now, web and mobile chat is available using Amazon connect (Hunt, 2019).

Any virtual assistant’s main components include a light ring, volume ring to control voice level, microphone array used to detect, record and listen to our voices, power port to charge the device and audio output. Virtual assistance, after that, recognize voice and store conversation in the cloud.

De-black boxing of virtual assistant (United States Patent No. US2012/0016678 A1, 2012)

Level 0:

Here, the virtual assistant is just a black box whose input is a voice commands from the user while the output is the voice response. Fig.1 includes the black box of the virtual assistant (Alexa, for example).

Fig1. Black box of Virtual Assistant

Level 1:

For level-1 de-black boxing, we can see the following components:

  • ASR (Automatic Speech Recognition): returns Speech as Text.
  • NLU (Natural Language Understanding): Interpret text as a list of possible intents (Commands).
  • Dialog manager: Look at intent and determine if it can handle it. The specified rules define which speechlet to be processed.
  • Data store: Includes the voice in a text response.
  • Text to speech: Translates skill outputs into an audible voice.
  • The third-party skill: The third party writes and is responsible for skill actions and operations. Fig.2 shows the level-1 de-black boxing of a virtual assistant (Alexa).

Fig2. Level-1 of De-black boxing of the Alex System

Level 2:

De-black box the ASR

The acoustic front-end takes care of converting the speech signal into corresponding features (speech parameters) via a process called feature extraction. The parameters of word/phone models are estimated from the acoustic vectors of training data. The decoder functions though the search of all possible word sequences to find the sequence of words that is most likely to generate. In a training phase, the operator will read all the vocabulary words and the word patterns are stored. Later, when for the recognition step, the word pattern is compared to the stored patterns and the word that gives the best match is selected. Fig3 illustrates the de-black box of ASR.

Fig3. Level2 of De-black box the ASR

De-black box of NLU (Natural Language Understanding)

Intent Classification (IC) and Named Entity Recognition (NER) use machine learning to recognize natural language variation. So, to identify and categorize key information (entities) in text, we need the NER of NLU. NER is a form of NLP, including two steps: detecting the named entity and the categorizing step. In step1, NER detects a word or thread of words that form a whole entity. Each word signifies a token: “The Great Lakes” is a thread of three tokens representing one entity. The second step requires the creation of entity categories like a person, organization, location, etc. IC labels the utterances of an NLP from a predetermined set of intents. Domain Classification is a text classification model that determines the target domain for a given query. It is trained using many labelled queries across all domains in an application. Entity Resolution is the last part of NLU that disambiguate records that correspond to real-world entities across and within datasets. So, to play “Creedence Clearwater Revival”, the NER will be “CCR (ArtistName)”, the Domain classifier is “music”, the IC is “PlayMusicIntent”, and the entity resolution will be ” Creedence Clearwater Revival”. Fig.4 includes the de-Blackbox of the NLU.

Fig4. Level2 of De-black box the NLU

Dialog Manager (DM)

DM selects what to report or say back to the user, whether to take any measure and decide to handle any conversation. DM includes two parts: dialog state tracking that estimates the user’s goals tracking the dialog context as input, and dialog policy which generates the next system action. Dialog state tracking can be done using RNN and neural belief tracker (NBT), while the dialog policy can be done using reinforcement learning (RL). Fig.5 shows Level2 of De-black box the DM.

Fig5. Level2 of De-black box the DM

De-black box of Text-To-speech TTS

The last part of the Virtual assistant allows computers to read text aloud. The linguistic front-end is used to convert input text to a sequence of features such as phonemes and sentence type. The prosody model predicts pattern and melody to form the expressive qualities of natural speech. The acoustic model is used to transform linguistic and prosodic information into the frame-rate spectral feature. Those features are fed into the neural vocoder and used to train a lighter and smaller vocoder. Neural Vocoder generates 24 kHz speech waveform. It consists of a convolutional neural network expanding the input feature vectors from frame rate into sample rate and a recurrent neural network synthesizing audio samples auto-regressively at 24,000 samples per second. Fig6 includes the details of TTS.

Fig6. Level2 of De-black box the TTS

Fig7 shows the De-black boxing of the Alex Echo system.

Fig7. Level2 of De-black boxing of the Alex System (De-black boxing of Echo system)


Amazon. (, 2021). Amazon Lex.  Retrieved from: (2012). United States. Patent no. US2012/0016678 A1.

Jeff Barr. (2016). amazon-lex-build-conversational-voice-text-interfaces Retrieved from: AWSNewsBlog:

Jeff Barr. (2017). AmazonLex–NowGenerallyAvailable. Retrieved from: AWSNewsBlog:

Randall Hunt. (2019). Amazon-Connect. Retrieved from: AWS Contact Center:

Wikipedia. (2021). Amazon_Alexa Retrieved from: Wikipedia:

wikipedia. (2021). Virtual_assistant.  Retrieved from: wikipedia_Virtual_assistant:

SIRI: Awesome But not Intelligent- Chirin Dirani

In general, virtual assistants (VAs) are software agents that can perform tasks or services for an individual based on commands and questions. These Commands and questions are received by VA through text, voice (speech recognition) or images. The VAs usage increased dramatically in the last three years and many products, using specifically email and voice interfaces, entered the market. While Apple and Google installed bases of users on their smartphones, Microsoft installed Windows-based personal computers, smartphones and smart speakers, Amazon installed base for smart speakers only, and Conversica engagements based on email and sms. In this assignment, I will focus on one of the speech recognition “Virtual Assistant” services by Apple, branded as Siri. By analyzing how Siri works, I will try to explain how NLP can help in converting human commands to actionable tasks by machines. 

 Siri is a speech-activated virtual assistant software that can interpret human speech and respond via synthesized voice. “The assistant uses voice queries, gesture based control, focus-tracking and a natural-language user interface to answer questions, make recommendations, and perform actions. Siri Does this through delegating requests to a set of internet services. The software adapts to users’ individual language usages, searches, and preferences, with continuing use. Similar to other speech- activated virtual assistants, Siri uses speech recognition and natural language processing (NLP) to receive, process and answer questions or implement demands. In what follows, I will try to analyze how this system works.

As mentioned before, NLP and speech recognition are the foundations of virtual assistant design. There are four main tasks that make this system process voice inputs into voice outputs. The process starts with converting a voice input (question or command) into text- interpreting text- taking a decision- converting text to speech out. This cycle repeats as much as the user continues asking or commanding the system. In more technical terms. Virtual assistant (Siri) receives the user’s voice input using a microphone. Speech recognition then uses NLP to encode voice input and convert it into recognizable computer data. Linking speech recognition to complex NLP helps the software to figure out what the user says, means, and what wants to happen. The software connects with a third party to make a decision and implements the user’s command (take action) or answer the user’s question by decoding the answer into recognizable computer data, to be then sent out as a speech sound output in Siri’s speaker. The following diagram illustrates the many levels that Siri’s complex system consists of.

Using virtual assistants already has and will have in the future many useful applications. Especially when it comes to medical applications and dealing with physically challenged individuals. However, the psychological effects derived from the emotional bonds that users could form with the future generations of Siri and similar VAs is alarming. Watching the controversial movie Her and reading about Gatebox made me deeply think of the future social and psychological impact of virtual assistants on the human race. Raising awareness about the design principles of VAs will definitely mitigate illusions and hypes created by marketing campaigns by companies for their VAs. Revealing the layers of this innovative system validates what Boris Katz said “current AI techniques aren’t enough to make Siri or Alexa truly smart.”



Hey Google …

I decided to choose to explore how Google Assistant works because I am an android user. I have a Samsung Galaxy S9 and never used my virtual assistant to the point that I had to Google how to turn it on. After playing around with it I tested all the functions Google Assistant said it could do: search the Internet, schedule events and alarms, adjust hardware settings on the user’s device, show information from the user’s Google account, engage in two-way conversations, and much more!

Google assistant can do all of this through its own natural language processers. From my understanding it follows the same kind of logic that we’ve been learning in the last couple of weeks. The premise is this:

  1. Using speech to text platform google assistant first converts spoken language to text that the system can understand (Week 6; Crash Course #36). Quick rundown on Speech recognition, using a spectrogram spoken vowels and whole words are convert into frequencies. These frequencies are the same for each vowel and creates what is termed a phoneme. Knowing these phonemes computers can convert speech into text. This text is then further broken down into the data components identifiable through Unicode.
  2. Once identifying the command or questions Google assistant takes the users inputs and runs it through a complex neural network with multiple hidden layers. I’m unsure what specific type of neural network Google uses, but for a quick rundown on neural networks: there is an input layer, hidden layer(s), and output layer connected through links like brain neurons. Algorithms learn on data sets in the hidden layer to create an output from the inputs given (Week 5; Machine Learning 3+4).
  3. Google goes through different process for different inputs whether it is a command or question. Producing the output and other required actions. Using the speech synthesis process, reverse of speech recognition process, to present an output to the users.

**I appreciated the figures presented in the beginning and took time to understand them a little more and I think the best in terms of understanding are Fig.1, Fig. 39, and Fig. 47 (I tried to paste them in my post but I don’t think it worked. 


Some defining notes from the Google’s Patent:

  • Google assistant has various embodiments of a computing device that can work independently or interact with each other.
  • The various embodiments Google assistant can take on allows it to have access, process, and/or otherwise utilize information from various devices as well as store memory in these different embodiments.
  • Google’s assistant adapts to its users by applying personal information, previous interactions, and physical context to provide more personalized results and improve efficiency.
  • Using active ontology and its adaptability to the user mentioned, it can predict and anticipate the next text using active input elicitation technique.

*I short google assistant knows a lot about us and is constantly gathering more data to improve its interface and understanding based on patterns.

Compared to the other virtual assistance like Apple’s Siri or Amazon’s Alexa, Google is more intelligent because it uses its own servers that capable of searching Google’s entire knowledge base for answers. However, Google is not as smart as GPT-3, I use the term smart loosely. GPT-3 is the most advance natural language processing system on the planet. Develop by OpenAI and released last year this is the closest humans have to a machine capable of producing responses coherent responses to any English task. It can do it because it has more parameters, about 175 billion more, to train and learn from. It really is just a bigger version of its predecessor GPT-2 and thus has the same shortfalls that GPT-2 faced regarding comprehension and understanding.

There are a lot of metaphors out there regarding what GPT-3 is and the one I like the most is that it is an improv actor. It can write articulate response that mimic a coherent entity, but it does not understand the meaning behind the text it is writing. The lack of logic and reasoning is evident in the shortfalls regarding semantics and culture. I do not want to completely detract from this momentous step but after further reading I agree with scientist that maybe a new approach is warranted there comes a point when the bigger thing will not work. The computerphile video put it into an interesting context if you want to get to space you get just continue building bigger rockets you have to re-approach the situation. I think this is the point we are at, especially when faced with issues arising from the amount of energy needed to conduct more training computations as well as the inherent racist and sexist bias within this data.


Computerphile. 2020. GPT3: An Even Bigger Language Model – Computerphile.
“Google Assistant.” 2021. In Wikipedia.
“GPT-3, Bloviator: OpenAI’s Language Generator Has No Idea What It’s Talking about.” n.d. MIT Technology Review. Accessed March 13, 2021.
“OpenAI’s New Language Generator GPT-3 Is Shockingly Good—and Completely Mindless | MIT Technology Review.” n.d. Accessed March 13, 2021.
“US9548050.Pdf.” n.d. Accessed March 13, 2021.
“Virtual Assistant.” 2021. In Wikipedia.
“Why GPT-3 Is the Best and Worst of AI Right Now | MIT Technology Review.” n.d. Accessed March 13, 2021.