Siri, launched by Apple Inc. in 2011, has been quite familiar to us as a voice assistant. It simplifies the navigation through our iPhone and the completion of our orders by listening and recognizing our voice. For example, Siri can tell the weather forecast, or call user’s contacts, or even tell a joke. The technologies behind Siri are mainly speech recognition and natural language processing, the two significant branches of machine learning.
Speech Recognition and Speaker Recognition
Speech recognition converts the acoustic signal from human into its corresponding textual forms. It primarily examines “what the user says”. Compared to speech recognition, Siri also leverages with speaker recognition to achieve personalization, which focuses on “who is speaking.” For instance, user can simply say “Hey Siri” to invoke Siri. However, it cannot work if any other people say the same words except the user. Enrollment and recognition become two processes to apply speaker recognition. User enrollment occurs when the user follows the set-up guidance from a new iPhone. By asking users to say several sample phrases, a statistical model for the user’s voice is created. The five sample phrases requested from the user show as below in order:
- “Hey Siri”
- “Hey Siri”
- “Hey Siri”
- “Hey Siri, how is the weather today?”
- “Hey Siri, it’s me.”
Figure 1. Block diagram of Personalized Hey Siri
The figure shows how the Personalized Hey Siri proceeds. Within Feature Extraction, the acoustic input is converted into a fix-length speaker vector, including the phonetic information, background information of the environment and user’s identity. Then the speaker’s characteristics are focused and other factors – such as phonetic and environment factors – are deemphasized to achieve the accurate recognition in any circumstances. Hence the five sample phrases will generate five speaker vectors, which are stored in the user profile in each Siri-enabled device.
Natural Language Processing
Figure 2. Deep Neural Network in Siri
After Siri understands what the user is saying, the converted texts are sent to Apple servers for further natural language processing algorithms to examine the intent of the user’s words. Figure 2 shows how Deep Neural Network (DNN) works in Siri. The DNN “consists mostly of matrix multiplications and logic nonlinearities. Each ‘hidden’ layer is an intermediate representation discovered by the DNN during its training to convert the filter bank inputs to sound classes. The final nonlinearity is essentially a Softmax function.” (Siri Team, 2017)
Alpaydin, Ethem. Machine Learning: the New AI. The MIT Press, 2017.
Siri Team. “Hey Siri: An On-device DNN-powered Voice Trigger for Apple’s Personal Assistant.” October 2017. https://machinelearning.apple.com/2017/10/01/hey-siri.html
Siri Team. “Personalized Hey Siri.” April, 2018.
Aman Goel. “How Does Siri Work? The Science Behind Siri.” Magoosh. Feb. 2, 2018.