Google assistance or other virtual assistants is “like a shortcuts to parts of app” (App Actions Overview | Google Developers, n.d.). I can activate the Google assistance by saying “Hey Google” and ask it to play a movie on my phone. In addition, we can also speak to it to book a restaurant or add a memo. Outside the black box of Google assistance, we can see that user activate the assistance and give it an unstructured command. Then the Google assistance analyzes the words and return order to specific apps to get the right answers or actions.
Fig1. data flow outside the blackbox – Source from App Actions Overview | Google Developers, n.d.
What’s in the black box? First, the questions or commands spoke by users are transformed into text (human representations). This process is called Automatic Speech Recognition (ASR). The user’s sound will be first stored in FLAC or WAV files and transmitted to Google’s server system. In the system, the data will be undergo signal processing and feature extraction by ASR and then encoded into vectors. Then ASR uses the trained acoustic model and language model to obtain scores respectively, combines these two scores to perform a candidate search, and finally gets the language Recognized result. After decoding the result, we finally get a text corresponding to the voice.
Second, since the users’ query might be unstructured, the text should be changed into a structured query and classified to the right model. By the way, unstructured means people have many different ways to ask for a same things. For example, “how’s the weather today” and “what is today’s weather forecast” both ask for the same information, but because of the many reasons, the way to ask questions is different. For this part, the NLP will use language pattern recognizers to map text with vocabulary databases, get the semantic matching and rank all the candidates to find the most likely matching. After that, the Google assistance can match the result to the specific task model like domain models, task flow models or dialog flow models.
Fig2. NLP procedure – Source from Gruber et al., 2017
Third, return output depends on the models results. “When a user’s query matches the predefined pattern of a built-in intent, Assistant extracts query parameters into schema.org entities and generates an Android deep link URL” (App Actions Overview | Google Developers, n.d.). In other words, based on the users’ commands the Google assistance will return results which people can understand to meet the requirement. If you want to watch adventure movie, it might activate the Netflix app or just give you a list of adventure movies. The difference of output depends on whether the Netflix app use the API with Google assistance. It is worth mentioning that Google duplex can help users book a restaurant or something like that by automatically talking to shop assistants with a phone call. “At the core of Duplex is a recurrent neural network (RNN) designed to cope with these challenges, built using TensorFlow Extended (TFX)” (“Google Duplex,” n.d.).
In short, though the Google assistance or other virtual assistants show like a human in some way, which means you can talk to it, you can ask it to do something only human can do before, it is still designed for specific tasks. It just recognizes and classifies people’s commands and follow different models to finish the tasks.
What is the difference between BERT and Google duplex? BERT is used for the Google search, but it seems that its effect is similar to the duplex in some way.
App Actions overview | Google Developers. (n.d.). Retrieved March 16, 2021, from https://developers.google.com/assistant/app/overview
Conversational Actions. (n.d.). Google Developers. Retrieved March 15, 2021, from https://developers.google.com/assistant/conversational/overview
Google Duplex: An AI System for Accomplishing Real-World Tasks Over the Phone. (n.d.). Google AI Blog. Retrieved March 16, 2021, from http://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html
Gruber, T. R., Cheyer, A. J., Kittlaus, D., Guzzoni, D. R., Brigham, C. D., Giuli, R. D., Bastea-Forte, M., & Saddler, H. J. (2017). Intelligent automated assistant (United States Patent No. US9548050B2). https://patents.google.com/patent/US9548050B2/en
Speech-to-Text basics | Cloud Speech-to-Text Documentation. (n.d.). Retrieved March 16, 2021, from https://cloud.google.com/speech-to-text/docs/basics