Siri’s Road to Accurate Speech Recognition

The set of operations involved in the eventual transformation of sound patterns into data with Siri begins with an acoustic wave detector, in Siri’s case, this is the M-series motion coprocessor, or “AOP” (Always On Processor).  The significance of this processor is how it does not require the main processor to be running in order to activate Siri (on mobile devices). The M-series detects the acoustic waves associated with the activation phrase “hey Siri” using MFCCs (Mel Frequency Cepstrum Coefficients) to transform the sound waves into coefficients to be then used to produce sound represented in a data form. Once these coefficients are produced, they are run through a frame buffer (RAM Random Access Memory) and transformed into pixels/bits. Once the sound has been transformed into bits, they are input into a small DNN with 32 hidden units. This small DNN uses an HMM (Hidden Markov Models) a statistical model that produces a score which in turn decides whether to activate the main processor or not.  Once the main processor is activated, a new DNN is accessed, with 196 hidden units, and this DNN also utilizes HMM to produce the most accurate interpretation of speech.

The road to accurate speech recognition has been a long process which required more crude techniques in the beginning stages of Siri. Initially, Siri required a user to manually activate her before providing commands, this allowed teams at Apple to collect data from the initial phases of Siri to be used later with remote activation. The early stages of Siri provided a speech corpus (audio file database) for which later versions of Siri to access, these larger audio file databases made the DNN’s coupled with HMM’s more accurate. Supervised standard backpropagation algorithms are used to reduce errors, and stochastic gradient descents are used for optimization of the algorithms. Siri, as with most other machine learning based programs is a work in progress, and can only be improved upon as the acquisition of data and more efficient algorithms becomes more available. 


How do bandwidth limitations effect the accuracy of speech recognition and virtual assistants?

I understand hidden units are mathematical algorithms used within a DNN, but how are they separated, or are they separated at all? Why are the number of hidden units in a DNN incremented in layers? 

In Apple’s breakdown of how Siri works, it glanced over the lower levels of sound wave input into the device, and did not breakdown how the sound waves become data, it simply states “acoustic input” but what hardware in the phone transforms the sound waves into an electrical signals? 

At what stage in the process is sound transformed into pixels then transformed into text, and does this involve interaction with NLPs in conjunction with the speech recognition processes?

Lastly, it is still unclear to me what purpose the framebuffer serves in the operations leading to speech recognition?


Acoustic model. (2020). In Wikipedia.

Backpropagation algorithm—An overview | sciencedirect topics. (n.d.). Retrieved March 15, 2021, from,solution%20to%20the%20learning%20problem.

Framebuffer. (2020). In Wikipedia.

Hey siri: An on-device dnn-powered voice trigger for apple’s personal assistant. (n.d.). Apple Machine Learning Research. Retrieved March 15, 2021, from

Hidden layer. (2019, May 17). DeepAI.

Mel-frequency cepstrum. (2020). In Wikipedia.

Paramonov, P., & Sutula, N. (2016). Simplified scoring methods for HMM-based speech recognition. Soft Computing, 20(9), 3455–3460.

Stochastic gradient descent. (2021). In Wikipedia.