“Hey Siri” – the DNN Acoustic Model

In our work de-blackboxing Google Translate, we learned the DNN’s job is to solve a problem. In the case of Google Translate the problem was to translate a phrase or sentence from one language(input) to another(output). In the case of a voice triggered personal assistant, the DNN will need to decode a voice command (input) and perform a task or answer a question (output). A Recurrent Neural Network was needed for Google Translate and for Apple’s Siri, a DNN Acoustic Model.

Layers / Process of Siri Voice Trigger 

(If using Siri on an iPhone)

  1. The microphone in your phone converts the sound of your voice into waveform samples 
  2. Spectrum analysis stage converts the waveform to a sequence of frames 
  3. ~20 frames at a time, are fed to the Deep Neural Network (DNN)
  4. Then, “The DNN converts each acoustic pattern into a probability distribution over a set of speech sound classes those used in the “Hey Siri” phrase, plus silence and other speech, for a total of about 20 sound classes (Siri Team, 2017). 

 

(Image retrieved from: https://machinelearning.apple.com/2017/10/01/hey-siri.html)

According to the  Apple’s Machine Learning article, An iPhone uses two networks (1. Detection, 2. Secondary Checker)

5. The way that the acoustic pattern is further detected is: if the outputs of the acoustic model have a high enough phonetic score for a target phrase. This process is further solidified through training –  over time, the more times a phrase is detected accurately – the more valid the sequence becomes. This process is shown in the top layer of image above as a recurrent network with connections to the same unit and the next in sequence (Team Siri, 2017). 

The DNN “hidden” layers in the neural network consist of learned representations during the training period of taking acoustic pattern (input) to sound classes (output).

In order to recreate the voice Siri’s voice trigger system – the main components we would need:

Hardware, Software and Internet Services

  1. A device with Internet connection (phone, smartwatch, bluetooth device)
  2. A microphone 
  3. Detector
  4. An Acoustic Input (voice)
  5. Server (can provide updates to acoustic models)
  6. Deep Neural Network — 2 networks : 1. Detection 2. Second Pass
  7. Training Process for the DNN
  8. Motion Coprocessor (to avoid using up battery life at all times the voice trigger is not being used)
  • Note: I have further questions about whether additional components listed in the below diagrams are a part of the above main features or if they need to be included as separate entities

This image looks at the active speech input procedure as a flow chart and includes the process of ranking interpretations for semantic relevance (process mentioned above) – this was also a key feature in the Google Translate process.

(image retrieved from: https://patentimages.storage.googleapis.com/5d/2b/0e/08f5a9dd745178/US20120016678A1.pdf)

Description of Automated Assistant from Apple Patent

“The conversation interface, and the ability to obtain information and perform follow-on task, are implemented, in at least some embodiments, by coordinating various components such as language components, dialog components, task management components, information management components and/or a plurality of external services” (Siri Team, 2017).

This quote is expressed in a useful image below – and helps to visualize the coordination of the components mentioned above.

(image retrieved from: https://patentimages.storage.googleapis.com/5d/2b/0e/08f5a9dd745178/US20120016678A1.pdf)

 

References

Siri Team. “Hey Siri: An On-Device DNN-Powered Voice Trigger for Apple’s Personal Assistant – Apple.” Apple Machine Learning Journal, https://machinelearning.apple.com/2017/10/01/hey-siri.html.
Gruber, Thomas Robert, et al. Intelligent Automated Assistant. US20120016678A1, 19 Jan. 2012, https://patents.google.com/patent/US20120016678A1/en.