“Alexa, what’s the weather like today?”
As Intelligent Personal Assistants begin to play a more significant role in our daily life, the conversation with the machine is no longer science fiction. But few ever bothered to ask the question: how do we come to a place like this? All the Intelligent Personal Assistant – Siri, Cortana, Alexa… are they inevitable or they happened to be like this? Or, in the end, what enables us to communicate with a machine?
Any Intelligent Personal Assistant could be considered as a complicated system. From software layer to hardware layer, a feasible intelligent personal assistant is the collective effort of many components – both tangible and intangible.
Thought a functioning intelligent personal assistant unit is the result of a bigger structure, the most intuitive part, from a user perspective, is the back and forth procedure of “human-machine interaction”. At the current stage, most of the technology companies that offer intelligent personal assistant service are trying to make their product more “human-like”. This – again – would be an entire project consists of big data, machine learning (deep learning), neural computational network and other disciplines related to or beyond Artificial Intelligence. But on the front-facing end, there is one subsystem we need to talk about – natural language processing (NLP).
What is NLP?
When decomposing the conversation flow between individuals, a three-step procedure seems to be the common practice. The first step would be to receive the information, generally, our ear would pick up the sound wave that is generated by some kinds of vibration and transmitted via air.
The second step would be to process the information. The acoustic signal that was received would be matched with the existing pattern in your brain so as to be entitled to corresponding meanings.
The third step would be the output of information. One would disseminate the message by generating the acoustic signal via transducers so that it could be picked up by the other end to keep the conversation flow.
When it comes to “human-machine interaction”, NLP follows a similar pattern by imitating the three-step procedure of inter-human communication. By definition, NLP is “a field of study that encompasses a lot of different moving parts, which culminates in the 10 or so seconds it takes to ask and receive an answer from Alexa. You can think of it as a process of roughly 3 stages: listening, understanding, and responding.”
In order to handle different stage of the procedure, Alexa was designed as a system with multiple modules. For the listening part, one of the “front-end” modules would pick up the acoustic signal with sensor upon voice commands or “activation phrases”.
This module would be connected to the internet with wireless technologies so that it would be able to send information to the back-end for further processing.
Understanding, which could also be referred to as the processing part, as the speech recognition software would take over and help the computer transcribe the user’s spoken English (or other supported languages) into corresponding texts. This procedure is the tokenization of the “acoustic wave” which is not a self-contained medium. By transforming, certain waves were turned into tokens and strings that machine could handle. The ultimate goal of this analyzing process is to turn the text into data. Here comes one of the hardest part of NLP: natural language understanding. Considering “all the varying and imprecise ways people speak, and how meanings change with context” (Kim, 2018) This would bring in the entire linguistic part of NLP. As NLU “entails teaching computers to understand semantics with techniques like part-of-speech tagging and intent classification — how words make up phrases that convey ideas and meaning.” (Kim, 2018)
This all happens on the cloud, which also simulates how the human brain functions when dealing with natural languages.
When a result was reached, it comes to the final stage – responding. This would be an inverse procedure of Natural Language Understanding since the data would be turned back into text. Now that the machine has the outcome, there would be two more efforts to make. One is prioritizing, which means to choose the data that’s most relevant to the user’s query and this leads to the second effort: reasoning. This refers to the process of translating the responding concept into a human-understandable way. Lastly, “Once the natural-language response is generated, speech synthesis technology turns the text back into speech.” (Kim, 2018)
As we now had some basic recognition of the NLP procedure, we could go back to the questions that were raised at the beginning: what is the point in designing the architecture of the NLP part of an Intelligent Personal Assistant in such a way?
We could talk about the transducer part of the system. This might be quite intuitive at a first glance. A sensor as a transducer would be the equivalent to the human ears to pick up the acoustic wave as needed. But design questions are involved here: what would be the ideal form of the housing of an Intelligent Personal Assistant?
As Siri was introduced to the world as a built-in function of iPhone, it must fit in a compact mobile device with a screen and incorporates only two microphones. This increased portability and flexibility at the cost of reliability.
It is a natural thing for a human to distinguish useful information from background noise. In a daily conversation flow, this refers to the fact that we would consciously pick up the acoustic waves that are relevant to our own conversation but not others.
When this was applied to the human-machine interaction scenario, error prevention of the direction to go: “rather than just help users recover from errors, systems should prevent errors from occurring in the first place.” (Whitenton, 2017) With the development of speech recognition technology, errors in NLU have dropped dramatically. “But there’s one clear type of error that is quite common with smartphone-based voice interaction: the complete failure to detect the activation phrase. This problem is especially common when there are multiple sound streams in the environment” (Whitenton, 2017)
To tackle this problem, Amazon built Alexa its dedicated hardware – Echo which put voice interaction as its top priority. “It includes seven microphones and a primary emphasis on distinguishing voice commands from background noise” (Whitenton, 2017)
NLP and Linguistic
Why is this so important? “Meaning is an event, it happens in the process of using symbols collectively in communities of meaning-making – the meaning contexts, the semantic networks and social functions of digitally encoded content are not present as properties of the data, because they are everywhere systematically presupposed by information users” (Irvine, 2014)
As the very first step in the human-machine interaction, the primary condition on the machine side would be the ability to properly receive the message from the human side. At the same time, context is very important in discussing the human-machine interaction. The purpose of NPL is to generate an experience that’s as close as possible to inter-human communication. As everyone conversation needs a starting point, a responsive Intelligent Personal Assistant “requires continuous listening for the activation phrase” (Whitenton, 2017) so that it could be less intrusive – in the case of Alexa, one would not need to carry it around or to follow any fixed steps to “wake up” the system. The only necessity is a natural verbal signal (Alexa) to trigger the conversation.
After the assistant acquired the information needed, the whole “black box” the lays underneath the surface starts functioning. As mentioned above, an Intelligent Personal Assistant would firstly send all the data to the “back-end”. As language is about coding “information into the exact sequences of hisses and hums and squeaks and pops that are made” (Pinker, 2012). Machines would then need the ability to recover the information from the corresponding stream of noises.
We could look at a possible methodology that machines would resort to in decoding the natural language
Part of Speech Tagging – or syntax. A statistical speech recognition model could be used here to “converts your speech into a text with the help of prebuilt mathematical techniques and try to infer what you said verbally.” (Chandrayan, 2017)
This approach takes the acoustic data and breaks it down into specific intervals e.g. 10 – 20 ms. “These datasets are further compared to pre-fed speech to decode what you said in each unit of your speech … to find phoneme (the smallest unit of speech). Then machine looks at the series of such phonemes and statistically determine the most likely words and sentences to spoke.” (Chandrayan, 2017)
Moving forward, the machine would look at the individual word and tries to determine the word class, the tense etc. As “NLP has an inbuilt lexicon and a set of protocols related to grammar pre-coded into their system which is employed while processing the set of natural language data sets and decode what was said when NLP system processed the human speech.” (Chandrayan, 2017)
Now that we had the foundation of decoding the language – by breaking it down, what would be the next step? Extracting the meaning. Again, the meaning is not a property but an event. In that sense, the meaning is not fixed – it changes all the time.
For inter-personal communication, we feel natural when we constantly refer to the context and spot the subtle differences.
But now, most of the Intelligent Personal Assistant “ is primarily an additional route to information gathering and can complete simple tasks within set criteria” (Charlton, 2017) This means they do not fully understand the user and their intuition.
For instance, when we are asking someone for the price of a flight ticket, the response – besides the actual price – could be “if you are going to a certain place or if you need a price alert for that flight”. But we could not really expect these kinds of follow up answers from an Intelligent Personal Assistant.
So, let’s go back to the inter-personal communication – how do we come up with the follow-up responses in the first place? We would conclude and deduct empirically to interconnect things that could be relevant – such as the intention to go somewhere and the action of asking the price of certain fight tickets. When we have the similar expectation on machines – on one hand, they would have to conduct a similar reasoning process as the ones that we do to draw the conclusion. On the other hand, they need a pool with an adequate amount of empirical resources to draw the conclusion from. The point is that the empirical part could have individual differences – which means the interaction pattern needs to be personalized on top of some general reasoning.
This is not something to be built overnight but rather a long-term initiative: “The technology is there to support further improvements; however, it relies heavily on user adoption … The most natural improvement we expect to see is more personalization and pro-active responses and suggestions.” (Charlton, 2017)
Now that machine has the “artificial language” in hands, the next step would be to translate this language into “meaningful text which can further be converted to audible speech using text-to-speech conversion”. (Charlton, 2017)
This seems to be relatively easier work compared to the Natural Language Understanding part of the NLP. As “The text-to-speech engine analyzes the text using a prosody model, which determines breaks, duration, and pitch. Then, using a speech database, the engine puts together all the recorded phonemes to form one coherent string of speech.” (Charlton, 2017)
Intelligent Personal Assistant as Metamedium
But as you look into the way many answers were generated, computer (in the case of Intelligence Personal Assistant this would be cloud computing) as a metamedium. This is significant in at least two ways.
To begin with, as metamedium, the Intelligent Personal Assistant “can represent most other media while augmenting them with many new properties” (Manovich, 2013) In the specific case of Alexa, the integration of both hardware and software as well as the synergy that was brought up by the synergy is significant.
Sensors, speakers, wireless module, cloud … all these elements could fulfill specific tasks by themselves. But by combining them together, the new architecture not only achieved goals that could never have been accomplished by any of the individual components. But these components, in turn, were entitled with new possibilities: like the sensors that were empowered by the software would be able to distinguish specific sounds from ordinary sounds.
Another important aspect would be the chemical reaction to be generated by the crossfire of all the individual components. In the case of Intelligence Personal Assistant, one of the possibilities could be data fusion: in “Software Takes Command” Manovich had the following description: “another important type of software epistemology is data fusion – using data from different sources to create new knowledge that is not explicitly contained in any of them.” (Manovich, 2013)
This could be a very powerful tool in the evolution of Intelligent Personal Assistant: “using the web sources, it is possible to create a comprehensive description of an individual by combining pieces of information from his/her various social media profiles making deductions from them” (Manovich, 2013) This idea is in line with the vision for an Intelligent Personal Assistant to be more personalized and proactive. If an Intelligent Personal Assistant would be granted proper access to user information and the user would be willing to communicate with the Intelligent Personal Assistant, it would be possible for the system to advance rapidly. So, the advantage of the Intelligent Personal Assistant with NLP capability as a metamedium would be its ability to combine the information from both ends (users and Social Media Platforms) so that it would be able to come up with a better decision.
At the same time, as users became one of the media sources in depicting the big picture of user personas, users would also benefit themselves in this procedure. “combining separate media sources could also give additional meanings to each of the sources. Considering the technique of the automatic stitching of a number of separate photos into a single panorama” (Manovich, 2013)
The Intelligent Personal Assistant, upon getting the input from users via NLP, could be a mirror and a dictionary to the users at the same time. It both reflects users’ characteristics and enhance the user experience due to the nature of it as a metamedium.
Another question that could be answered by the metamedium side of Intelligent Personal Assistant is “why we would need such a system?”. When looking back to the trajectory of technological development, we could notice that the procedure of HCI evolution and the “metamedium” ecology around the computer is pretty much a history of the mutual education of computer and human as well.
Before we get used to a smartphone with built-in camera, people would question the necessity of this idea: why would I need a phone that could take pictures? But now we are so used to using phones as our primary photographing tools and even handle a great part of media production on it. Again – using smartphones for PS and video editing is something that didn’t happen until smartphone as a platform digested camera as an appropriate unit and the hardware development entitled the platform with the capabilities to do so. And this trend might have – to a great extent – led to the popularity of SNS like Instagram and Snapchat.
Similar stories could be applied to Intelligent Personal Assistant. When Siri – as the first mainstream Intelligent Personal Assistant – was released back in 2011, the criticisms it received ranged from requiring stiff user commands and having a lack of flexibility to lacking information on certain nearby places as well as the inability to understand certain English accents. People doubted the necessity of having such a service on their phone to drain the battery. Now, after seven years of progress, not only do we see the boom in Intelligent Personal Assistant, we get used to it as well. Especially in certain scenarios – like when you are cooking, and you want to set an alarm or pull up the recipe or you are driving, and you want to set the navigation app. Intelligent Personal Assistant with NLP capability is – by far – probably the best solution to these used-to-be dilemmas.
In a market research conducted by Tractica, “unique active consumer VDA users will grow from 390 million in 2015 to 1.8 billion worldwide by the end of 2021. During the same period, unique active enterprise VDA users will rise from 155 million in 2015 to 843 million by 2021. The market intelligence firm forecasts that total VDA revenue will grow from $1.6 billion in 2015 to $15.8 billion in 2021.” (Tractica, 2016)
(VDA refers to Virtual Digital Assistants)
After the brief discussion of Intelligent Personal Assistant with a focus on NLP, it is a good time to touch upon an important principle when dealing with the Intelligent Personal Assistant. We spent most of the paper talking about NLP and barely touched a fraction of what NLP really is. Yet NLP is only a subsystem in the Intelligent Personal Assistant architecture which itself, is only a representation of a larger discipline – Artificial Intelligence.
So, when talking about Intelligent Personal Assistant or NLP, we couldn’t regard them as isolated property which does not recognize the universal connection among system and subsystems as well as their interdependence: “systems thinking is non-reductionist and non-totalizing in the methods used for developing explanations for causality and agency: nothing in a system can be reduced to single, independent entities or to other constituents in a system.” (Irvine, 2018)
This requires us to put both Intelligent Personal Assistant and NLP into context. As Intelligent Personal Assistant is the result of the joint work of many other subsystems like NLP, and NLP itself is also built on the foundation of its own subsystem. Any of the units here would not have achieved what we have now on their own.
After all, Graphite and diamond are both consisted of carbon, just a different pattern of the structure of the element. But they end up with a totally different character. When we look at a single point, we would simply miss the whole picture.
Intelligent Personal Assistant is a great representation of Artificial Intelligence in the sense that it creates a tangible platform for a human to interact with. Under this circumstance, NLP as a subsystem provides the Intelligent Personal Assistant with the tool to communicate naturally with its users.
In de-blackboxing NLP, we looked at both the software and hardware layers of NLP, with a step-by-step pattern of listening, understanding, and responding. For different layers and steps, all the components including transducers, cloud, and voice recognition software work both independently and collectively to generate the “natural communication” that we experience in the real life.
For the methodology part, we regard the Intelligent Personal Assistant as a metamedium in analyzing the ability and potential it possesses to evolve and transform. We also touched upon the basic linguistic elements that were used in designing the processes of NLP. Finally, the complexity and systems thinking approach were brought in to emphasize the Intelligent Personal Assistant and NLP as both a self-contained entity and a part of the architecture.
1: Kim, Jessica. “Alexa, Google Assistant, and the Rise of Natural Language Processing.” Lighthouse Blog, 23 Jan. 1970, blog.light.house/home/2018/1/23/natural-language-processing-alexa-google-nlp.
2: Whitenton, Kathryn. “The Most Important Design Principles Of Voice UX.” Co.Design, Co.Design, 28 Apr. 2017, www.fastcodesign.com/3056701/the-most-important-design-principles-of-voice-ux.
3: Irvine, Martin. “Key Concepts in Technology, Week 4: Information and Communication.” YouTube, YouTube, 14 Sept. 2014, www.youtube.com/watch?v=-6JqGst9Bkk&feature=youtu.be.
4: Pinker, Steven. “Steven Pinker: Linguistics as a Window to Understanding the Brain.” YouTube, YouTube, 6 Oct. 2012, www.youtube.com/watch?v=Q-B_ONJIEcE.
5: Cjamdrayam, Promod. “A Guide To NLP : A Confluence Of AI And Linguistics.” Codeburst, Codeburst, 22 Oct. 2017, codeburst.io/a-guide-to-nlp-a-confluence-of-ai-and-linguistics-2786c56c0749.
6: Charlton, Alistair. “Alexa vs Siri vs Google Assistant: What Does the Future of AI Look like?” Gearbrain, Gearbrain, 27 Nov. 2017, www.gearbrain.com/alex-siri-ai-virtual-assistant-2510997337.html.
7: Manovich, Lev. Software Takes Command. vol. 5;5.;, Bloomsbury, London;New York;, 2013.
8: Tractica. “The Virtual Digital Assistant Market Will Reach $15.8 Billion Worldwide by 2021.” Tractica, 3 Aug. 2016, www.tractica.com/newsroom/press-releases/the-virtual-digital-assistant-market-will-reach-15-8-billion-worldwide-by-2021/.
9: Irvine, Martin. “Media, Mediation, and Sociotechnical Artefacts: Methods for De-Blackboxing.” 2018.