Deblackboxing Siri as a virtual assistant

Virtual assistant is an emerging topic in artificial intelligence field. It can perform tasks for its users based on verbal commands. It is normally implanted in digital devices like smart phones, personal computers, and smart speakers. Apple’s Siri is one of them. Siri is voice-activated by personalized “Hey, Siri” and then it provides information or performs tasks as commanded. The procedure is composed of various layers and each layer is responsible for specific task or tasks. It would be clearer to deblackbox it by layers

According to Apple’s Patent Application for “An intelligent automated assistant system”, a system for operating an intelligent automated assistant includes

  • one or more processors that start with the Detector

The Deep Neural Network (DNN) is used to detect “Hey Siri.” First, the microphone turns your voice into a stream of waveform samples, and then these waveform samples are converted to a sequence of frames through spectrum analysis. DNN converts each of these acoustic patterns into a probability distribution. “Hey Siri” can be detected if the outputs of the acoustic model fit the right sequence for the target phrase. After Siri is activated, it can perform tasks as requested.

  • memory storing instructions that cause the processors to perform operations, including
  • obtaining a text string from a speech input received from a user

For example, if I want my iPhone to call my Mom while I am driving, I would say “Hey, Siri” to activate Siri, and then say “call Mom” to give a command. Through speech recognition, my speech will be turned into a text string than can be processed by the processor.

  • interpreting the received text string to derive a representation of user intent

Through NLP, the processor interprets “call Mom” as an instruction to dial a person who is remarked as “Mom” in the contacts.

  • identifying at least one domain, a task, and at least one parameter for the task, based at least in part on the representation of user intent

After interpretation, this layer links my instruction to “Phone” domain and opens “Phone” function.

  • performing the identified task

My iPhone calls “Mom” using the phone number I saved in the contacts.

  • provide an output to the user, wherein the output is related to the performance of the task.

The procedure above is a simplified version of how Siri receive and perform our verbal instructions. It is noticeable that there are nested complicated layers implanted in each layer which are waiting to be deblackboxed.



Apple Machine Learning Journal (1/9, April 2018): “Personalized ‘Hey Siri’.