De-black boxing of virtual assistant (Alexa)


Week 8

Alexa is a well-known virtual assistant developed by amazon using AI in 2014 (Wikipedia, 2021). Alexa can play music, interact with our voices, make to-do lists, setting alarms, provide weather information, etc. (wikipedia, 2021). We can use Alexa as a home automation system controlling our smart devices (Wikipedia, 2021) (Amazon, 2021). Besides that, we can install extension functionality called skills, adding them to Alexa. Device manufactures can integrate Alexa voice capabilities in their products using the Alex voice service. In this way, any products built with this cloud-based service have access to a list of automatic speech recognition and natural language processing capabilities. Amazon uses the long-short term memory LSTM for generating voices (Amazon, 2021). In 2016, Amazon released Lex, making the speech recognition and natural processing language NLP available for developers to create their chat-bots (Barr, 2016). Less than a year later, Lex became generally available (Barr, AmazonLex–NowGenerallyAvailable, 2017). Now, web and mobile chat is available using Amazon connect (Hunt, 2019).

Any virtual assistant’s main components include a light ring, volume ring to control voice level, microphone array used to detect, record and listen to our voices, power port to charge the device and audio output. Virtual assistance, after that, recognize voice and store conversation in the cloud.

De-black boxing of virtual assistant (United States Patent No. US2012/0016678 A1, 2012)

Level 0:

Here, the virtual assistant is just a black box whose input is a voice commands from the user while the output is the voice response. Fig.1 includes the black box of the virtual assistant (Alexa, for example).

Fig1. Black box of Virtual Assistant

Level 1:

For level-1 de-black boxing, we can see the following components:

  • ASR (Automatic Speech Recognition): returns Speech as Text.
  • NLU (Natural Language Understanding): Interpret text as a list of possible intents (Commands).
  • Dialog manager: Look at intent and determine if it can handle it. The specified rules define which speechlet to be processed.
  • Data store: Includes the voice in a text response.
  • Text to speech: Translates skill outputs into an audible voice.
  • The third-party skill: The third party writes and is responsible for skill actions and operations. Fig.2 shows the level-1 de-black boxing of a virtual assistant (Alexa).

Fig2. Level-1 of De-black boxing of the Alex System

Level 2:

De-black box the ASR

The acoustic front-end takes care of converting the speech signal into corresponding features (speech parameters) via a process called feature extraction. The parameters of word/phone models are estimated from the acoustic vectors of training data. The decoder functions though the search of all possible word sequences to find the sequence of words that is most likely to generate. In a training phase, the operator will read all the vocabulary words and the word patterns are stored. Later, when for the recognition step, the word pattern is compared to the stored patterns and the word that gives the best match is selected. Fig3 illustrates the de-black box of ASR.

Fig3. Level2 of De-black box the ASR

De-black box of NLU (Natural Language Understanding)

Intent Classification (IC) and Named Entity Recognition (NER) use machine learning to recognize natural language variation. So, to identify and categorize key information (entities) in text, we need the NER of NLU. NER is a form of NLP, including two steps: detecting the named entity and the categorizing step. In step1, NER detects a word or thread of words that form a whole entity. Each word signifies a token: “The Great Lakes” is a thread of three tokens representing one entity. The second step requires the creation of entity categories like a person, organization, location, etc. IC labels the utterances of an NLP from a predetermined set of intents. Domain Classification is a text classification model that determines the target domain for a given query. It is trained using many labelled queries across all domains in an application. Entity Resolution is the last part of NLU that disambiguate records that correspond to real-world entities across and within datasets. So, to play “Creedence Clearwater Revival”, the NER will be “CCR (ArtistName)”, the Domain classifier is “music”, the IC is “PlayMusicIntent”, and the entity resolution will be ” Creedence Clearwater Revival”. Fig.4 includes the de-Blackbox of the NLU.

Fig4. Level2 of De-black box the NLU

Dialog Manager (DM)

DM selects what to report or say back to the user, whether to take any measure and decide to handle any conversation. DM includes two parts: dialog state tracking that estimates the user’s goals tracking the dialog context as input, and dialog policy which generates the next system action. Dialog state tracking can be done using RNN and neural belief tracker (NBT), while the dialog policy can be done using reinforcement learning (RL). Fig.5 shows Level2 of De-black box the DM.

Fig5. Level2 of De-black box the DM

De-black box of Text-To-speech TTS

The last part of the Virtual assistant allows computers to read text aloud. The linguistic front-end is used to convert input text to a sequence of features such as phonemes and sentence type. The prosody model predicts pattern and melody to form the expressive qualities of natural speech. The acoustic model is used to transform linguistic and prosodic information into the frame-rate spectral feature. Those features are fed into the neural vocoder and used to train a lighter and smaller vocoder. Neural Vocoder generates 24 kHz speech waveform. It consists of a convolutional neural network expanding the input feature vectors from frame rate into sample rate and a recurrent neural network synthesizing audio samples auto-regressively at 24,000 samples per second. Fig6 includes the details of TTS.

Fig6. Level2 of De-black box the TTS

Fig7 shows the De-black boxing of the Alex Echo system.

Fig7. Level2 of De-black boxing of the Alex System (De-black boxing of Echo system)


Amazon. (, 2021). Amazon Lex.  Retrieved from: (2012). United States. Patent no. US2012/0016678 A1.

Jeff Barr. (2016). amazon-lex-build-conversational-voice-text-interfaces Retrieved from: AWSNewsBlog:

Jeff Barr. (2017). AmazonLex–NowGenerallyAvailable. Retrieved from: AWSNewsBlog:

Randall Hunt. (2019). Amazon-Connect. Retrieved from: AWS Contact Center:

Wikipedia. (2021). Amazon_Alexa Retrieved from: Wikipedia:

wikipedia. (2021). Virtual_assistant.  Retrieved from: wikipedia_Virtual_assistant:

This entry was posted in Week 8 on by .

About Heba Khashogji

As a true believer in the seeds of obedience that blossom in our lives my life found happiness in honoring my parents. This leads me to the passion I’ve been fulfilling, to be an agent of change both in the corporate and societal environment. I advocate to work on social services to create and promote equity, opportunity and improvement of the people and the community. I offer more than a decade of experience and accomplishment in human resource, driving implementation in employee development, quality management systems, salary standardization, compensation and benefits management, personnel services management and company reorganization and realignment. One of my achievements is the creation of a quality management procedures and policies as an strategic and tactical efforts that drove our company, Khashoggi Holding Company in its International recognition as Quality Crown Gold Awardee in 2014. Going back, when I started working as a volunteer accountant/admin to setup Dar AlHekma College, the first private college for ladies in the Saudi Arabia and my first official career in King Fahad Armed Forces Hospital, I developed an interest in human relations and developed this interest into my participation to the implementation of quality management and standardization of policy management systems in these organizations. Demonstrating initiative in the start, I applied and implemented integration programs in Personnel Section leading to employees' satisfaction by delivering fair and reasonable benefits to all. Throughout my career, I had the opportunity to establish a strong network contacts in and out of the country through my active participation in several seminars and workshops. The scope of my experience has spanned practically in all aspects of HR as well as leadership. Another passion I am in love with is the aiding to the propagation of young Saudi generation be with better traits and characters created children books, converted to animated videos shown in local TV channels to help reinforcing behavioral change in the Arab region bringing them to be more well-mannered individuals and be more diplomatic among them as well as with their foreign friends exercising tact and courtesy in every encounter. Just recently, another 2 things in my wish list are achieved, to skydive and take Master course. Skydiving made me challenge myself and conquer my fears that can help me overcome obstacles in my future. I am not stopping to dream and I am not stopping to learn. I still see myself in a class, for 23 years from now, physical or virtual. I thirst for knowledge and I always crave for new ideas not even in the time of pandemic.