Week 8 Reflection

Stephanie

Designing a virtual assistant is no simple task, but to do so, we would need to think about how virtual assistants work. They work via text (online chat, especially in an instant messaging app or another app, SMS Text, e-mail), voice (Amazon Alexa, Siri, Google Assistant), and through taking and/or uploading images (Samsung Bixby on the Samsung Galaxy S8). As a broad overview, virtual assistants use NLP to match user text or voice input to executable commands. These assistants continue to learn over time by using artificial intelligence techniques including machine learning. 

Before Apple integrated its hands-free virtual assistant, it began allowing users to use Siri by first pressing the buttons of their home screens and then followed by saying “Hey Siri.” This is an important step in the process of developing hands-free virtual assistants because it tells us how Apple trained its technologies. The users’ “Hey Siri” utterances used for the initial training set for the US English detector model. They also included general speech examples, as used for training the main speech recognizer. To check the initial automatic transcripts for accuracy, Apple hired a team of people to monitor the data that would be the foundation of the program was correct.

Apple products, like many virtual assistant products, are built with a microphone. This is responsible for capturing audio, which turns our voices into a stream of instantaneous waveform samples at a rate of 16000/second. After accumulating these waveforms, they are converted into a sequence of frames that each describes the sound spectrum of approximately 0.01 sec. These are fed into a Deep Neural Network acoustic model, which converts the acoustic patterns into a probability distribution over a set of speech sound classes. For example, those used in the phrase “Hey Siri” (accounting for silence) total to about 20 sound classes. 

In order to keep the technology hands free and therefore activate upon command, a small speech recognizer runs all the time and listens for just its ‘wake word’.  In iPhones, this is known as the Always On Processor (AOP). While Apple uses “Hey Siri,” other well-known wake words include “OK Google” or “Hey Google”, “Alexa”, and “Hey Microsoft.” When the speech recognizer detects the wake word(s), the device parses the speech that follows as a command or query.

Once the acoustic patterns of our voice at each instant are converted into a probability distribution over speech sounds, a temporal integration process computes a confidence score that the phrase you uttered was in fact the wake word. If the score is high enough, the virtual assistant wakes up. It is also important to note that the threshold to decide whether to activate Siri is not a fixed value.

The Deep Neural Network acoustic model, once trained with not only our wake word but also some sort of corpus of speech allows virtual assistants to provide a sound class label for each frame and ultimately estimate the probabilities of the states given the local acoustic observations. “The output of the acoustic model provides a distribution of scores over phonetic classes for every frame. A phonetic class is typically something like ‘the first part of an /s/ preceded by a high front vowel and followed by a front vowel.’”

Once the question/task is converted into speech waves and processed through the DNN, Apple licenses Wolfram Alpha’s Knowledge Base. This knowledge base is able to respond to fact-based questions, with the example from Wikipedia as such: “How old was Queen Elizabeth II in 1974?” Wolfram Alpha displays its “input interpretation” of such a question, using standardized phrases such as “age | of Queen Elizabeth II (royalty) | in 1974”, the answer of which is “Age at start of 1974: 47 years”, and a biography link. 

In terms of a virtual assistant’s voice, after databases have been trained, many companies hire local voice talent and have them read books, newspapers, web articles, and more. These recordings are transcribed to match words to sounds in order to identify phonemes, the individual sounds that make up all speech. “They try to capture these phonemes spoken in every imaginable way: trailing off at the end of the word, harder at the beginning, longer before a pause, rising in a question. Each utterance has a slightly different sound wave…every sentence Siri speaks contains dozens or hundreds of these phonemes, assembled like magazine cut-outs in a ransom note. It’s likely that none of the words you hear Siri say were actually recorded the way they’re spoken”(Wired). As companies continue to hunt for the right voice talent, they run the speech of those who audition through the models they’ve built looking for phoneme variability—”essentially, the sound-wave difference between the left and right side of each tiny utterance. More variability within a phoneme makes it hard to stitch a lot of them together in a natural-sounding way, but you’d never hear the problems listening to them speak. Only the computer sees the difference” (Wired). Once the right person is found who sounds right to both human and computer, they are weeks at a time, and that becomes the voice of the virtual assistant.

References

Apple Machine Learning Research. “Hey Siri: An On-Device DNN-Powered Voice Trigger for Apple’s Personal Assistant.” Accessed March 15, 2021. https://machinelearning.apple.com/research/hey-siri.

“How Apple Finally Made Siri Sound More Human.” Wired. Accessed March 14, 2021. https://www.wired.com/story/how-apple-finally-made-siri-sound-more-human/.

“Virtual Assistant.” In Wikipedia, March 10, 2021. https://en.wikipedia.org/w/index.php?title=Virtual_assistant&oldid=1011400052.

“WolframAlpha.” In Wikipedia, March 13, 2021. https://en.wikipedia.org/w/index.php?title=WolframAlpha&oldid=1011910358.

Questions: 

I’m having a hard time understanding the mathematical side of the DNN Can you please explain?
“The DNN consists mostly of matrix multiplications and logistic nonlinearities. Each “hidden” layer is an intermediate representation discovered by the DNN during its training to convert the filter bank inputs to sound classes. The final nonlinearity is essentially a Softmax function (a.k.a. a general logistic or normalized exponential), but since we want log probabilities the actual math is somewhat simpler.”