The traditional voice of a computer – that electronic, crackling voice that you can almost feel the circuitry in – is a Frankenstein of phonemes that a program could piece together to create an illusion of a semi-auditory computer interface. By creating these phoneme mashups, computers could speak to the user. But, this interface feels markedly inorganic. As humans speak to each other, the intonation and tone that they use to inflect certain phonemes changes given the context of the phoneme. An “R” sound that I make when I exclaim, “Great!” sounds different from the “R” that concludes the interrogative “Can you hand over that weed-whacker?”
With a digitized catalogue of every phoneme, you might imagine that the issue of conversationally interfacing with computers would be solved. In my mind, the process of digitally “hearing” and synthesizing a voice would be the most difficult within a conversational interface. But, of course, from my human vantage point, the digitization is the hardest part. Humans have a capacity to intuitively understand language that is often taken for granted. When I read a sentence, I don’t have to sit for a second to parse through the relationships that each word has with the sentence as a whole and /then/ consider how that sentence as a whole interacts with the sentences and world around it. I just intuitively understand the sentence.
Early chatbots and vocalizers could give the illusion of “natural conversation.” But, early chatbots were simply branching algorithms whose responses were predetermined and depended on proper inputs. To achieve actual conversational interface, computers would need to understand the context and “meaning” of what was said to them.
Google uses “The Knowledge Graph” to semantically link together information across the web. Instead of “Apple” only existing as a simple string of letters, it is now, in Google’s eyes a symbol that is linked to “seeds” and “computers” and “pies.”
Image source: Digital Reasoning
As the web became an increasingly gigantic web of semantic information linked together in “meaningful relationships,” the contextual nature of conversationally transmitted data became easier and easier for a computer to understand (Crash Course Video). Learning from a dataset that links “dinosaur” with “extinction” and “triceratops” and “cute” helps give natural language processing a better grasp of how humans use language.