Deblacking “Translation” Paper on Amazon Echo Plus (with cover) (.docx)
Deblacking “Translation” Paper on Amazon Echo Plus
Heba Khashogji
Abstract
Smart speakers have gained wide popularity among many users because it provides more luxury. Still, some users feel that this type of device constitutes a violation of privacy, and there is no point in using it. In this paper, we will talk about the Amazon Echo Plus, its main components, and how it works step-by-step by following the “deblackboxing” method.
Introduction
At present, smart speakers have become widely used among people, and according to Jakob and Wilhelm (2020), Amazon dominates the smart speaker market along with Google, Xiaomi, and others, and among these speakers is the Amazon Echo family.
In this paper, we will talk specifically about the Amazon Echo Plus, the “smart speaker.” This smart speaker is powered by Amazon’s cloud-based voice service known by the name Alexa. Smart speakers have many uses, including the field of healthcare for the elderly. Ries and Sugihara, 2018 and Robinson et al., 2014 claimed that the technology itself has proven that it is able to provide healthcare thanks to the existing technologies and its current functions as Amazon Echo Plus is used as an alternative healthcare provider to humans in the early stages of people with dementia. These devices also use the Internet of Things (IoT) that helps control home appliances by voice recognition. You can also listen to music on demand by any artist or genre from many platforms such as Spotify, Amazon Music, and Apple Music through such devices.
- Systems Thinking
Amazon Echo Plus starts working when it hears the word “Alexa” from a user. The word Alexa refers to the virtual assistant from Amazon. The alert word “Alexa” can be changed later to “Echo,” “Amazon,” or “Computer.”
When the virtual assistant hears the alert word, it starts working, and the ring at the top lights up in blue color. Then Echo Plus can be asked any question, for example, about the weather, and it answers the weather with a summary of what the weather will be like during the day.
Echo Plus has a built-in hub. This hub supports and controls ZigBee smart devices, such as light bulbs and door locks, which can be bound to the home assistant asking Alexa to “discover the devices.” Similarly, when the user asks Amazon Echo Plus any question by voice command, Echo Plus records the audio and sends it through the Amazon cloud servers. These servers convert the recorded voice into text which will be analyzed, and therefore Alexa finds the best way to answer this text. This answer is converted back to audio, and this information is sent to the Echo Plus smart speaker to show the audio response (Rak et al., 2020).
Amazon Echo Plus features local voice control and allows us to control our home devices without any internet connection. However, if one needs to listen to music from Spotify or Amazon, an internet connection is required.
- Design Thinking and Semiotic Thinking
Below is a simple example that shows how the Amazon Echo Plus works. We will assume in this example that the user says “Hello world” for the purpose of examination:
First, to start the device, the user says, “Hello world.” When the device hears the wakeup word “Alexa,” it starts to listen. Second, the Amazon Echo Plus device sends the speech to the Alexa service via the cloud to recognize speech. After which, it converts it into text, and the natural language processing operations are performed to identify the purpose of the request. Third, Alexa sends a JSON file that contains the demand to Lambda Function to handle the request. Lambda function is one of Amazon Web services that run user’s code only when needed, so there is no need to run servers continuously. In our example, the lambda function will return “Welcome to the Hello world” and send it to the Alexa service. Fourth, Alexa receives a JSON response and converts the resulting text into an audio file. Finally, the Amazon Echo Plus receives and plays audio for the user. As you can see below, figure 1 shows how the user interacts with the Amazon Echo Plus device (Amazon Alexa, n.d.).
Figure 1: User Interaction with Amazon Echo Plus (Alexa Developer, n.d.)
- JSON (Intent/ Response)
“JavaScript Object Notation” is one way of formatting that structures data used chiefly by web applications for communication. JSON syntax is created based on JavaScript object notation syntax (Wazeed, 2018):
- Data is in name/value pairs. Example: {“fruit”:” Banana”}.
- Data is separated by commas. Example: {“fruit”: ”Banana”, “color”: ”yellow” }
- Curly braces hold objects.
Figure 2 shows an example of JSON code; inside the Intents array, there’s a HelloWorldIntent and one of the built-in intents: AMAZON.HelpIntent. AMAZON.HelpIntent responds to sentences that contain words or phrases indicating that the user needs help, such as “help.” Alexa creates an intent JSON File after it converts speech to text.
Figure 2: An Example of JSON Code (Ralevic, 2018)
- Text to Speech System
Text-to-speech is done in several stages. The input Text to Speech System (TTS) is a text that is analyzed, then that text is converted into an audio description, after which a tone is generated. The main units of the text-to-speech architecture are as follows (Isewon et al., 2014). Figure 3 shows text to speech System:
Figure 3: Text to Speech System (Isewon et al., 2014)
- Natural Language Processing Unit (NLP): It produces an audio version of the text on the input. The primary operations of the NLP unit are as follows:
- Text analysis: First, the text is decomposed into tokens. Token-to-word conversion creates the orthographic form of the token. For example, the token “Mr” is transformed to “Mister”; it is constituted by expansion.
- Application of the pronunciation rules: after the first stage is complete; the pronunciation rules are applied. In some cases, the letter can correspond to no sound (for example, “g” in “sign”), or multiple characters correspond to a single phoneme (such as: “ch” in “teacher”). There are two approaches to determine pronunciation:
- Dictionary-based with morphological components: as many as possible words are stored in a dictionary. Pronunciation rules determine the pronunciation of words that are not found in the dictionary.
- Rule-based: pronunciations are created from the phonological knowledge of dictionaries. Only words whose pronunciation is an exception are included in the dictionary.
If the dictionary-based method has a large and enough phonetic dictionary, it will be more exact than the rule-based method.
- Prosody Generation: after the pronunciation is specified, the prosody is created. Prosody is essential for specifying an affective state. If any person says, “It is a delicious pizza,” it can reflect whether that person likes the pizza or not, which depends on a person’s intonation. Text to Speech system (TTS) is based on many factors such as intonation modeling (phrasing and accentuation), amplitude, and length modeling (including sound length and pause, which determine syllable length and speech tempos) (Isewon et al., 2014).
- Digital Signal Processing Unit (DSP): It converts the symbolic information received from NLP into understandable speech.
- Convert Text to Tokens
Alexa divides Speech into tokens according to the following (Gonfalonieri, 2018) (Trivedi et al., 2018) :
- The wake-up word: The wake-up word “Alexa” tells the Amazon Echo Plus to start by listening to the user’s commands.
- Launch word: The word launch is a transitional action word indicating to Alexa that a skill summons will likely follow. Typical launch words include “tell, ask and open.”
- Invocation name: To initiate an interaction with a skill, the user says the skill’s recall name. For example, to use the weather skill, a user could say, “Alexa, what’s the weather today?”
- Utterance: Simply put, spoken speech is a user’s spoken request. These spoken requests can invoke a skill and provide input to a skill.
- Prompt: A string of text that must be pronounced to the user to request information. You include prompt text in your response to a user request.
- Intent: an action that fulfills the user’s spoken request. Intents can optionally contain arguments called apertures.
- Slot value: slots are input values that are provided in the spoken user request. These values help Alexa in knowing the user’s intent.
Figure 4 shows that the user is giving the entry information, the travel date for Friday. This value is an intent slot, which Alexa will transfer to Lambda to process the skill code.
Figure 4: Dividing Words Into Tokens ( Amazon Alexa, n.d.)
- Speech Recognition
Speech recognition is the machine’s ability to identify words and phrases in the spoken language and convert these words or phrases into text that the machine can handle (Trivedi et al., 2018). There are three ways that computer performs matching speech with stored phonetics:
- Acoustic phonetic approach: Hidden Markov Model (HMM) is used in this approach. Hidden Markov Model develops a non-deterministic probability model for speech recognition. HMM consists of two variables: the hidden states of the phonemes stored in computer memory and the visible frequency segment of the digital signal. Each phoneme has a probability, and the syllable is matched with the phoneme according to the probability. Then, the matched phonemes are collected together to form the correct words according to the language’s grammar rules, which are stored previously.
- Pattern recognition approach: Speech recognition is one of the areas of pattern recognition. It falls under what is known as supervised learning. In a supervised learning system, we have a dataset where the input (audio signal) and output (text corresponding to the audio signal) of the dataset is known. The dataset is divided into two sets: a training set and a testing set. Supervised learning is also divided into two phases: the training phase and the testing phase. In the training phase, the training set is used and entered into a specified model and trained with a certain number of iterations to produce our trainer model. The trainer model is tested by a test set to ensure that it is operating properly. In the speech recognition stage, the user’s voice is matched with the previously trained pattern and so on until the recognized sentence is produced as a text (Trivedi et al., 2018).
- Artificial intelligence approach: it is based on the use of main knowledge sources such as sounds, spoken knowledge based on spectral measurements, proper meaningful knowledge, and syntactical words knowledge.
Figure 5 shows a typical speech recognition system.
Figure 5: Typical Speech Recognition System (Samudravijaya, 2002)
- User- Speaker Interaction
Amazon Echo Plus has powerful microphones. The device needs to be activated; the microphone always works and waits for the wake-up word “Alexa” to be activated (Jakob, 2020). Figure 6 shows the voice processing system. Microphones in Echo plus convert voice signal, which is a continuous signal to digital signal. The process of converting analog signal to digital signal has three stages:
- Sampling: Samples are taken at equal time periods, and a frequency samples a periodic signal called a cutoff frequency. The cutoff frequency must be equal to more than twice the maximum frequency of the input signal. This is called Nyquist’s theorem.
- Quantization: The second step assigns a numerical value to the voltage level. This process searches for the closest value corresponding to the signal amplitude out of a specific number of possible values, covering the whole amplitude range. The size of the quantizer scope must be a power of 2 (such as 128, 256 …).
- Coding: After the closest discrete value is identified, a binary numerical value is assigned for each discrete value. Quantizer identifies the discrete value, and a numerical value is assigned corresponding to each discrete value, then it is encoded as a binary number. The quantization and encoding process cannot be entirely correct and can only provide an approximation of the real values. AS higher as possible of the quantizer resolution, the closer this approximation will be to the real value of the signal (Pandey, 2019).
Figure 6: Voice Processing System (Abdullah et al., 2019)
According to Abdullah et al. (2019), audio is processed to remove noise and then passed to the signal processing phase. Preprocessing involves applying a low pass filter to remove noise from the voice background. A low pass filter can be defined as a frequency filter that passes signals with a lower frequency than cutoff frequency and prevents higher frequency than cutoff frequency, as shown in figure 7.
On the other hand, Signal Processing is considered a major component in voice processing. It captures the most important part of the input signal. Where the major component of signal processing is Fast Fourier Transform (FFT): Fourier transforms a signal from the time domain to the frequency domain. Fast Fourier transform is an algorithm used to calculate discrete input faster than computing it directly (Maklin, 2019). After which an FFT and its magnitude are taken, which generates a frequency domain representation of the audio called a magnitude spectrum.
Figure 7: Ideal Low Pass Filter (Obeid et al., 2017)
- Ethics and Policy
Intelligent systems, including Internet of Things (IoT) systems, manage a very large amount of personal data, which is unknown to many users with limited experience. Also, these devices control most home appliances, such as home air conditioning systems, home lighting, washing machines …, which makes this type of system questionable in terms of security and privacy (Rak et al., 2020). One of the main reasons that prevent users from increasing the use of IoT systems is because they collect, process, and share personal user data with other parties. There are many IoT systems that collect user data without their knowledge or consent (Thorburn et al., 2019).
A woman from Oregon discovered that her smart assistant had recorded a voice call between her and her husband, and the recorded call was sent to one of her contacts on her phone. The existence of many of these violations has led to the adoption of many privacy regulations, such as the European General Data Protection Regulation (GDPR). GDPR is in European Union law. The main aim of GDPR is to give people control over their personal data and prevent their data from sharing without their consent. The GDPR consists of provisions as well as requirements related to the personal processing data of people located in the EU (Thorburn et al., 2019).
To this end, Echo Plus always listens to his alert word “Alexa” and starts to work when it thinks that it heard this word, then it begins to record the voice and receive the commands, which can be seen through the blue light of the ring at the upper part. It does not record anything except that it is waiting for a word of alert from the user. Amazon uses encryption to protect the audio recordings that Alexa uploads. These audio files can be deleted at any time that the user wants. Amazon Echo Plus also allows the user to stop recording via the microphone by pressing the “mute button” and prevent him/ her from hearing anything, even the alert word, and then the ring turns red (Crist and Gebhart, 2018).
Conclusion:
This paper discussed the most popular smart speaker device, “Amazon Echo plus.” The paper explained how the device works and its main components. The main discussion points and concepts were tackled, including; Natural Language Processing, converting speech to text and converting text to speech. In the end, the paper elaborated on the ethics and how the device tries to provide more privacy for users.
Bibliography
- Abdullah, H., Garcia, W., Peeters, C., Traynor, P., Butler, K. R., & Wilson, J. (2019). Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems. arXiv:1904.05734v1.
- Amazon Alexa (n.d.). Build an Engaging Alexa Skill Tutorial. Retrieved from https://developer.amazon.com/en-US/alexa/alexa-skills-kit/get-deeper/tutorials-code-samples/build-an-engaging-alexa-skill/module-2.
- Crist, R., & Gebhart, A. (2018, September 21). Retrieved from https://www.cnet.com/home/smart-home/amazon-echo-alexa-everything-you-need-to-know/.
- Gonfalonieri, A. (2018, November 21). How Amazon Alexa works? Your guide to Natural Language Processing (AI). Retrieved from towards data science: https://towardsdatascience.com/how-amazon-alexa-works-your-guide-to-natural-language-processing-ai-7506004709d3.
- Isewon, I., Oyelade, J., & Oladipupo, O. (2014). Design and Implementation of Text To Speech Conversion for Visually Impaired People. International Journal of Applied Information Systems (IJAIS).
- Abdullah, H., Garcia, W., Peeter, C., Traynor, P., Butler, K. R., & Wil, J. (2019). Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems. arXiv:1904.05734v1
- Alexa Developer (n.d.). Build an Engaging Alexa Skill Tutorial. Retrieved from https://developer.amazon.com/en-US/alexa/alexa-skills-kit/get-deeper/tutorials-code-samples/build-an-engaging-alexa-skill/module-1.
- Alexa Developer (n.d.). Build an Engaging Alexa Skill Tutorial. Retrieved from https://developer.amazon.com/en-US/alexa/alexa-skills-kit/get-deeper/tutorials-code-samples/build-an-engaging-alexa-skill/module-1.
- Isewon, I., Oyelade, J., & Oladipupo, O. (2014). Design and Implementation of Text To Speech Conversion for Visually Impaired People. International Journal of Applied Information Systems (IJAIS).
- Jakob, D. &. Wilhelm, S. (2020). Amazon Echo: A Benchmarking Model Review. Retrieved from https://www.researchgate.net/profile/Sebastian-Wilhelm/publication/343280283_Amazon_Echo_A_Benchmarking_Model_Review/links/5f21125ba6fdcc9626bc9691/Amazon-Echo-A-Benchmarking-Model-Review.pdf.
- Maklin, C. (2019, December 19). Fast Fourier Transform. Retrieved from https://towardsdatascience.com/fast-fourier-transform-937926e591cb.
- Obeid, H., Khettab, H., Marais, L., & Hallab, M. (2017). Evaluation of Arterial Stiffness by Finger-Toe Pulse Wave Velocity: Optimization of Signal Processing and Clinical Validation. Journal of Hypertension. DOI:10.1097/HJH.0000000000001371.
- Pandey, H. (2019, November 25). Analog to Digital Conversion. Retrieved from https://www.geeksforgeeks.org/analog-to-digital-conversion/.
- Rak, M., Salzillo, G., & Romeo, C. (2020). Systematic IoT Penetration Testing: Alexa Case Study. Italian Conference on Cyber Security, (pp. 190-200). Ancona.
- Ralevic, U. (2018, July 24). How To Build A Custom Amazon Alexa Skill, Step-By-Step: My Favorite Chess Player. Retrieved from https://medium.com/crowdbotics/how-to-build-a-custom-amazon-alexa-skill-step-by-step-my-favorite-chess-player-dcc0edae53fb.
- Ralevic, U. (2018, July 24). How To Build A Custom Amazon Alexa Skill, Step-By-Step: My Favorite Chess Player. Retrieved from https://medium.com: https://medium.com/crowdbotics/how-to-build-a-custom-amazon-alexa-skill-step-by-step-my-favorite-chess-player-dcc0edae53fb
- Ries, N. &. (2018, December 10). Robot revolution: Why technology for older people must be designed with care and respect. Retrieved from https://theconversation.com/robot-revolution-why-technology-for-older-people-must-be-designed-with-care-and-respect-71082.
- Robinson, H., MacDonald, B., & Broadbent, E. (2014). The role of healthcare robots for older people at home: A review. International Journal of Social Robotics, 6(4), 575-591.
- Samudravijaya, K. (2002). Automatic Speech Recognition. Tata Institute of Fundamental Research. Retrieved from http://www.iitg.ac.in/samudravijaya/tutorials/asrTutorial.pdf.
- Thorburn, R., Margheri, A., & Paci, F. (2019). Towards an integrated privacy protection framework for IoT: contextualising regulatory requirements with industry best practices. DOI:10.1049/cp.2019.0170.
- Trivedi, A., Pant, N., Pinal, P., Sonik, S., & Agrawal, S. (2018). Speech to text and text to speech recognition systems-A review. IOSR Journal of Computer Engineering (IOSR-JCE), 36-43.
- (2018, June 6). JavaScript JSON. Retrieved from https://www.geeksforgeeks.org/javascript-json/.