In general, virtual assistants (VAs) are software agents that can perform tasks or services for an individual based on commands and questions. These Commands and questions are received by VA through text, voice (speech recognition) or images. The VAs usage increased dramatically in the last three years and many products, using specifically email and voice interfaces, entered the market. While Apple and Google installed bases of users on their smartphones, Microsoft installed Windows-based personal computers, smartphones and smart speakers, Amazon installed base for smart speakers only, and Conversica engagements based on email and sms. In this assignment, I will focus on one of the speech recognition “Virtual Assistant” services by Apple, branded as Siri. By analyzing how Siri works, I will try to explain how NLP can help in converting human commands to actionable tasks by machines.
Siri is a speech-activated virtual assistant software that can interpret human speech and respond via synthesized voice. “The assistant uses voice queries, gesture based control, focus-tracking and a natural-language user interface to answer questions, make recommendations, and perform actions. Siri Does this through delegating requests to a set of internet services. The software adapts to users’ individual language usages, searches, and preferences, with continuing use. Similar to other speech- activated virtual assistants, Siri uses speech recognition and natural language processing (NLP) to receive, process and answer questions or implement demands. In what follows, I will try to analyze how this system works.
As mentioned before, NLP and speech recognition are the foundations of virtual assistant design. There are four main tasks that make this system process voice inputs into voice outputs. The process starts with converting a voice input (question or command) into text- interpreting text- taking a decision- converting text to speech out. This cycle repeats as much as the user continues asking or commanding the system. In more technical terms. Virtual assistant (Siri) receives the user’s voice input using a microphone. Speech recognition then uses NLP to encode voice input and convert it into recognizable computer data. Linking speech recognition to complex NLP helps the software to figure out what the user says, means, and what wants to happen. The software connects with a third party to make a decision and implements the user’s command (take action) or answer the user’s question by decoding the answer into recognizable computer data, to be then sent out as a speech sound output in Siri’s speaker. The following diagram illustrates the many levels that Siri’s complex system consists of.
Using virtual assistants already has and will have in the future many useful applications. Especially when it comes to medical applications and dealing with physically challenged individuals. However, the psychological effects derived from the emotional bonds that users could form with the future generations of Siri and similar VAs is alarming. Watching the controversial movie Her and reading about Gatebox made me deeply think of the future social and psychological impact of virtual assistants on the human race. Raising awareness about the design principles of VAs will definitely mitigate illusions and hypes created by marketing campaigns by companies for their VAs. Revealing the layers of this innovative system validates what Boris Katz said “current AI techniques aren’t enough to make Siri or Alexa truly smart.”