Do Robots Speak In Electric Beeps?: Artificial Intelligence & Natural Language Processing (Alexander MacGregor)


When we think of the term “artificial intelligence”, a certain array of images often comes to mind. Be it humanoid robots doing housework, or sinister machines ruthlessly stepping over humans on their path to dominance, much of the public discourse surrounding this term has been driven by our media and arts, and all the artistic license that comes with them. But if we explore the history of artificial intelligence and its applications, we see that the tasks we have traditionally attempted to offload onto AI are less whimsical, but perhaps just as fundamental to our experience as semiotically capable cognitive beings. In this paper, I will trace this history while focusing in on the specific task of natural language processing (NLP), examining the models we use to offload the linguistic capabilities that we, as humans running OS Alpha, obtain at birth.

Brief History of Artificial Intelligence

Although artificial intelligence only became a formal academic discipline at the Dartmouth Conference of 1956, the ideas and concepts that came to shape the field were present as far back as Ancient Greece. It was Aristotle’s attempts to codify “right-thinking” that first laid ground for much of the logic-based framework that AI philosophy resides within (Russell & Norvig, 17). In what was perhaps the first step in the history of AI related cognitive offloading, 14th century Catalan philosopher Ramon Llull conceived of the mechanization of the act of reasoning in his Ars generalis ultima. In the 16th Century, Leonardo da Vinci outlined the designs for a mechanical calculator, which was eventually realized in 1623 by a German scientist by the name of Wilhelm Schickard, although it was Blaise Pascal’s “Pascaline” calculator built 20 years later that is more widely recognized (Russell & Norvig, 5).

Leonardo da Vinci's sketch of a mechanical calculator

Leonardo da Vinci’s sketch of a mechanical calculator

Around the same time, English philosopher Thomas Hobbes, German philosopher Gottfried Leibniz, and French philosopher Rene Descartes were each advancing the discourse on this topic. In the introduction to his seminal book, Leviathan, Hobbes asks the reader “For seeing life is but a motion of limbs, the beginning whereof is in some principal part within, why may we not say that all automata (engines that move themselves by springs and wheels as doth a watch) have an artificial life? For what is the heart, but a spring; and the nerves, but so many strings; and the joints, but so many wheels, giving motion to the whole body” (Hobbes, 7). Leibniz was attempting to discover a “characteristica universalis”, which would be a formal and universal language of reasoning allowing for all debate and argument to be unambiguously reduced to mechanical operation (Levesque, 257). It is impossible to ignore the impact of Rene Descartes on the formation of this field. While he is perhaps most well known for his theory of mind-body dualism, he also developed more directly automation based observations, such as conceptualizing animals as machines (Russell & Norvig, 1041).

In the 19th and 20th Centuries, we began to see attempts to build machines capable of executing on the ideas promoted by previous philosophers. Charles Baggage’s Difference Engine was an attempt to mechanize computational work previously done by “human computers”. Babbage also designed, but was never able to build, what he called an “Analytical Engine”, which was really the first design for what we now know as a “general purpose computer” (Dasgupta, 27). The code breaking frenzy of the Second World War provided an environment in which many computational advances were made, and there was perhaps no more influential figure to emerge from this era than Alan Turing. Considered by many to be the father of modern computing, Turing’s work during this era was crucial to the prismatic explosion of AI and computing advancements that we saw in the latter half of the 20th century.

London Science Museum's model of Charles Babbage's Difference Engine

London Science Museum’s model of Charles Babbage’s Difference Engine

Brief History of Natural Language Processing

In 1950, Alan Turing published his paper “Computing Machinery and Intelligence”, which proved to be pivotal in the yet-to-exist field of artificial intelligence. In the paper, Turing contemplated the possibility of building machines that are capable of “thinking”, and introduced the concept of the Turing Test as a way to determine whether a machine was exhibiting such traits to the extent they were indistinguishable from a human being (Dennett, 3-4). When we, as humans, engage in the exchange of ideas, symbols and messages that assure us of our respective intelligence and “personhood”, we do it through the interface of language. It is truly one of the key enablers of our semiotic skill-set. So if we are to create artificial intelligence, then the ability to communicate signs, symbols and meaning through language is a top priority.

Four years after Alan Turing published his seminal paper, IBM and Georgetown University held a joint exhibition to demonstrate their fully automatic machine translation capabilities. Using an IBM 701 mainframe computer, the operator was able to translate Russian sentences comprising of 250 words and six grammatical rules into English (Hutchins, 240). This “breakthrough” was highly publicized, and led the authors and press to make bold predictions about the immediate future of artificial intelligence and machine translation, but the reality of the situation was much less grandiose. The program was only able to seem successful by severely restricting the grammatical, syntactic and lexical possibilities far short of any realistic conceptions of a truly artificial intelligence.

Newspaper clipping detailing the IBM-Georgetown Machine Translation Experiment

Newspaper clipping detailing the IBM-Georgetown Machine Translation Experiment

Successes & Limitations

This was, in fact, the story of most of the NLP attempts made during the early days of AI. Although successful, programs like Daniel Bobrow’s STUDENT, designed to solve high school level algebraic word questions, and Joseph Wizenbaum’s ELIZA, famously used to simulate the conversation of a psychotherapist, were still operating within very strict linguistic constraints. ELIZA, for example, wasn’t capable of analyzing the syntactic structure of a sentence or deriving its meaning, two elements that are crucial for true language comprehension. Nor was it able to extrapolate on its linguistic inputs to explore its environment. ELIZA was, in fact, the first chatterbot, designed only to respond to certain keyword inputs by with a pre-set answer. (Bermudez, 31)

The limitations of these early NLP systems gave rise to the micro-world approach of MIT AI researchers Marvin Minsky and Seymour Papert, most famously exhibited in fellow MIT researcher Terry Winograd’s SHRDLU program. Micro-worlds were problems that would require real intelligence to solve, but were relatively simple and limited in scope (Russell & Norvig, 19-20). Papert initially saw micro-worlds as a way to connect computing to the hard sciences, where simple models were often used to derive fundamental scientific principles (Murray, 430). Winograd’s SHRDLU program put this approach to test, and was one of the earliest attempts to get machines to do true natural language processing, which meant the system would “report on its environment, plan actions, and reason about the implications of what is being said to it” (Bermudez, 32). SHRDLU was a success and prompted a lot of excitement around the field of NLP, but because it was so dependent on syntactic analysis and was operating within a micro-world, many of the same limitations from the early machine translation attempts were present in SHRDLU (Russell & Norvig, 23). The simplicity of the micro-world constraints meant SHRDLU’s language was correspondingly simple, as it could only talk about the events and environments of the micro-world it inhabits.

The micro-world of Terry Winograd's SHRDLU program

The micro-world of Terry Winograd’s SHRDLU program

Even with these constraints, SHRDLU did contribute to three major advancements in the evolution of NLP. Firstly, it displayed that the conceptual and theoretical rules of grammar could be practically executed in an NLP program. Secondly, it showcased the approach of breaking down cognitive systems into distinct components that each executes a specific information-processing task. Thirdly, it was built on notion of language as an algorithmic process (Bermudez, 33). These factors would set the stage for how NLP programs would be built moving forward.

In many ways, machine translation and NLP followed a similar historical trajectory as speech recognition attempts. The early excitement prompted by the information theory and word-sequencing models of the 1950s would be tempered in favour of the highly knowledge-intensive and specific micro-world approach of the 1960s. The 1970s and 1980s saw a push for commercialization of these previously academically restricted programs, but also, more importantly, the rise of the neural network approach. (Russell & Norvig, 26)

This prompted a civil war of sorts between the “connectionist” camp advocating for approaches like neural networks, the “symbolic” camp advocating symbol manipulation as the best frame to understand and explore human cognition, and the “logicist” camp advocating a more mathematical approach. Even to this day there has been no truly definitive resolution to this conflict, but the modern view is that the connectionist and symbolic frameworks are complementary, not incompatible. (Russell & Norvig, 25)

Enter Stage Right: Artificial Neural Networks

When looking at the modern natural language processing landscape, one sees that artificial neural networks (ANNs), particularly recurrent neural networks, are the en vogue computational approach (Conneau, Schwenk, Barrault & LeCun, 1). Originally inspired by the mid-20th century neuroscience discovery that mental processes are composed of electrochemical activity in networks of brain cells called neurons, artificial intelligence researchers aimed at modeling their approaches after this system (Norvig, 727). Two of the most prominent early advocates of this method were Warren McCulloch and Walter Pitts, who in 1943 designed a mathematical and algorithmic computational model for these neural networks. Yet due to a lack of research funding and the publication of an influential research paper by Minsky and Papert in which they detailed the limitations of the computational machines being used to run neural networks at that point in time, the neural network approach was sidelined until the late 1980s when the neural network back-propagation learning algorithms first discovered in 1969 by Arthur Bryson and Yu-Chi Ho were reinvented (Norvig, 761). The very influential textbook “Parallel Distributed Processing: Explorations in the Microstructure of Cognition” by David Rumelhart and James McClelland also helped to reinvigorate the AI community’s interest in this approach.

Model of a feedforward neural network

Model of a feedforward neural network

Yet a lack of computational power would mean other machine learning methods, such as linear classifiers or support vector machines, would hold precedence over neural networks. That was the case until the computational processing hardware landscape evolved to a state where technologies such as GPUs and computational approaches like distributed computing made it possible for neural networks to be deployed on the scale necessary to handle tasks like natural language processing.

So How Exactly Do These Neural Networks Work?

In a nutshell, a neural network is a collection of computational units connected together that “fires” an output when its inputs cross a predefined hard or soft threshold. (Russell & Norvig, 727-728) The earliest models were designed with only one or two layers, but they ran into limitations when it came to approximating basic cognitive tasks. Later models would solve this problem by adding a layer of “hidden units” and giving the nets the ability to adjust the connection weights.

This video is a helpful visual introduction to the concept:

What Does Linguistics Have To Say About All This? 

Due to the brain being a massively parallel organ with neurons apparently working independently of each other, artificial neural networks have been used as an approach to computationally offload many of the cognitive functions the brain performs, such as pattern recognition, action planning, processing and learning new information, and using feedback to improve performance (Baars, 68). Language is an inherently symbolic activity, so if we are to offload the task of natural language processing to artificial intelligence, the capability of neural nets to be translated into symbolic form, and for symbolic forms to be translated back into neural nets, is a feature that makes this approach very attractive.

In addition to being symbolic, language is also a practical, rule-governed activity. It was Noam Chomsky, often considered to be the father of modern linguistics, who first attempted to discover why it is that language operates in the manner it does (Bermudez, 16). In his groundbreaking book Syntactic Structures, Chomsky makes a distinction between the deep structure and surface structure of a sentence. The former is referring to how the basic framework of a sentence is governed by phrase structuring rules operating at the level of syntactic elements such as verbs, adjectives and nouns. The latter refers to the organization of words in a sentence, which must abide by the sentence’s deep structure. The important point to note here is that language is conceived of as hierarchical, algorithmic, and rule-based. The rules extend to not only grammar and syntax, but also individual words and contextual meaning.

Chomsky's famous grammatically correct, yet semantically unintelligible sentence.

Chomsky’s famous grammatically correct, yet semantically unintelligible sentence.

Adding onto Chomsky’s insights was his student at MIT in the 1960s, Ray Jackendoff, whose “parallel architecture” linguistic model sought to debunk the syntactocentric models of old and promote a framework positing the independent generativity of the semantic, phonological and syntactic elements of language (Jackendoff, 107). From Jackendoff, we can conceptualize language as a combinatorial structure in which elements work in a parallel fashion to produce expression and meaning. Again, a processing architecture is at the basis of this framework.

Jackendoff's model of Parallel Architecture

Jackendoff’s model of Parallel Architecture

ANNs and NLP, Live Together in Perfect Harmony? 

While artificial neural networks do not have linguistic rules inherently built into them like the human brain is thought to, they have been shown to capable of modeling complex linguistic skills. The simple recurrent neural networks designed by Jeff Elman have been successful trained to predict the next letter in a series of letters, and the next word in series of words. Studies done by developmental psychologists and psycholinguists that examine the patterns children display when they learn languages have shown that in many features of language acquisition, human beings follow a very archetypal trajectory. One example would be making similar types of grammatical construction mistakes at similar learning stages. When artificial neural network researchers analyzed the degree to which their models can reproduce these features of language processing, they found similarities between how the neural networks learn and how children learn. (Bermudez, 246)

Verb tense is another specific area in which much research has been conducted testing the natural language processing capabilities of artificial neural networks. While the actual computational process is quite complex, it essentially boils down to the theory that children learn the past tense in three distinct stages. In the first stage, they use only a small number of verbs in primarily irregular past tenses. In the second stage, the number of verbs in use expands and they formulate past tense in the “standard stem + -ed” format. In the third stage, as they learn more verbs, they correct their “over-regularization errors”. Where artificial neural nets come in is in their ability to develop a similar learning pathway without needing to have linguistic rules explicitly coded in them. (Bermudez, 247)

*Record Scratch* Let’s Pump The Brakes A Bit

It is important to note at this juncture that artificial neural nets are nowhere close to mirroring the brain’s ability to perform these tasks, and neither is that the goal. The aim is to enable machines to engage in natural language processing, regardless of the similarity of method to how humans engage in natural language processing. There is no imperative to follow the same rule-based framework for language that humans use, because artificial neural networks are not attempts to reconstruct the human brain or even mirror its intricacies, but rather to behave in accordance with rule-governing aspects of linguistic understanding, even though they do not represent those rules. They are simply an approach inspired by this one element of how our brains process information. Compared to the massively complex brain, most of the simulations run through artificial neural nets are relatively small-scale and limited. But for certain cognitive tasks, neural nets have proven to be more successful than programs using logic and standard mathematics (Baars, 68). The neural network approach provides certain affordances that make computation of this scale and nature more effective, such as its ability to handle noisy inputs, execute distributed and parallel computation, and to learn. It is not imperative to resolve any conflicts between the way we believe the brain to be operating and the way neural networks are architected. As Texas A&M University Professor of Philosophy Jose Bermudez states:

 “The aim of neural network modeling is not to provide a model that faithfully reflects every aspect of neural functioning, but rather to explore alternatives to dominant conceptions of how the mind works. If, for example, we can devise artificial neural networks that reproduce certain aspects of the typical trajectory of language learning without having encoded into them explicit representations of linguistic rules, then that at the very least suggests that we cannot automatically assume that language learning is a matter of forming and testing hypotheses about linguistic rules. We should look at artificial neural networks not as attempts faithfully to reproduce the mechanics of cognition, but rather as tools for opening up new ways of thinking about how information processing might work.” (Bermudez, 253-254)

What Does The Future Hold?

The future of natural language processing and artificial intelligence is sure to be shaped by the tech giants currently absorbing research talent at a vociferous rate. Companies like Google, Facebook, Microsoft, Amazon, and Twitter have all identified businesses uses for this technology. For Facebook, it’s their DeepText engine that filters unwanted content from their users’ newsfeeds. Google’s uses for this technology are varied, but include user experience in apps, search, ads, translate and mobile. Microsoft’s research team is looking to this technology to design and build software.

This corporate takeover has not gone without concern. For the majority of the history of AI research, universities and public research institutions have been the incubation chambers for breakthroughs, and they have a far more transparent culture than corporations driven by profit maximization and incentivization towards harbouring trade secrets. In order to assuage this concern, many of these companies have embraced an open source culture when it comes to their findings. They have encouraged their researchers to publish and share their work (to an extent) with the broader community, under the rationalization that a collegial atmosphere will create gains that everyone can utilize. Bell Labs and Xerox PARC have become the aspiration models, as it was precisely the accessibility and open environment of these institutions that allowed innovation to thrive.

Xerox PARC's infamous beanbag chair meetings

Xerox PARC’s infamous beanbag chair meetings

This is surely one of the main reasons we’ve witnessed an exodus of academic researchers into these companies. Two of the most prominent names in the field right now are Geoffrey Hinton and Yann LeCun. Hinton, a former University of Toronto professor considered to be the godfather of deep learning, was scooped up by Google to help design their machine learning algorithms. LeCun, a former New York University professor, is now the Director of AI Research at Facebook. Undoubtedly, the extremely large data sets these companies have collected are also a powerful draw, as they allow for training bigger and better models. When asked what he perceives the future of NLP and artificial neural nets to be, Hinton answered:

For me, the wall of fog starts at about 5 years. (Progress is exponential and so is the effect of fog so its a very good model for the fact that the next few years are pretty clear and a few years after that things become totally opaque). I think that the most exciting areas over the next five years will be really understanding videos and text. I will be disappointed if in five years time we do not have something that can watch a YouTube video and tell a story about what happened” (Hinton).

A similar question was posed to University of Montreal professor of Computer Science Yoshua Bengio, also considered to be one of the preeminent figures in the field right now, to which he responded:

I believe that the really interesting challenge in NLP, which will be the key to actual “natural language understanding”, is the design of learning algorithms that will be able to learn to represent meaning” (Bengio).

Where Does “Meaning” Fit Into The Equation? 

If meaning-making is the ultimate purpose of language, then the true holy grail of natural language processing through artificial neural networks is unsupervised learning. The majority of the current models being employed utilize a supervised learning technique, meaning the network is being “told” by the designers what mistakes and errors it is making (Bermudez, 220). With unsupervised learning, the training wheels come off and the network receives no supervisory external feedback, learning on its own (Arbib, 1183). According to University of California, Berkley Professor Michael I. Jordan, one of the leading researchers in the fields of machine learning and artificial intelligence, unsupervised learning is “presumably what the brain excels at and what’s really going to be needed to build real “brain-inspired computers”” (Jordan).


Journeying through the history of artificial intelligence, we saw just how broad and deep the philosophical roots of this field are. From canonical figures like Aristotle and Descartes to modern heavyweights like Turing and Chomsky, the scope of thinkers contributing to artificial intelligence advancements is proof positive of its interdisciplinary nature. The problems posed by the quest to cognitively offload key human faculties require answers drawing from such diverse fields as computer science, neurology, linguistics, mathematics, physics, and engineering. Out of all the cognitive tasks we have attempted to offload to AI, natural language processing is perhaps the most important. As the renowned cognitive scientists, linguist and psychologist Steven Pinker has stated:

For someone like me, language is eternally fascinating because it speaks to such fundamental questions of the human condition. Language is really at the center of a number of different concerns of thought, of social relationships, of human biology, of human evolution, that all speak to what’s special about the human species. Language is the most distinctively human talent. Language is a window into human nature, and most significantly, language is one of the wonders of the natural world.” (Big Think)

It is only natural that in the quest to technologically mediate this uniquely human skill, we looked to our own brain for inspiration. But while certain neurological features have surely inspired artificial neural networks, the dominant natural language processing model, AI designers, researchers and architects are not bound by them. The goal is to get computational machines to process natural language. How one gets there is relatively inconsequential. Due to the exponential increase in size and quality of the data sets used to train artificial neural nets, we are sure to see some exciting advances in natural language processing over the next few years, but as of now, the ultimate goal of a “strong AI” capable of dealing with the concept of linguistic meaning remains behind the “wall of fog”.

Works Referenced

  1. Russell, Stuart J., and Peter Norvig. Artificial Intelligence: A Modern Approach. Third ed. Upper Saddle River, NJ: Prentice Hall, 2010.
  2. Hobbes, Thomas. Leviathan. Urbana, Illinois: Project Gutenberg, 2002. Web. 2 December, 2016.
  3. Levesque, Hector J. “Knowledge Representation and Reasoning.” Annual Review of Computer Science1 (1986): 255-87. Web. 6 Dec. 2016.
  4. Dasgupta, Subrata. It Began With Babbage: The Genesis of Computer Science. Oxford: Oxford UP, 2014.
  5. Dennett, Daniel C. Brainchildren: Essays On Designing Minds. Cambridge, MA: MIT, 1998.
  6. Hutchins, John. “From First Conception to First Demonstration: the Nascent Years of Machine Translation, 1947-1954. A Chronology.” Machine Translation, vol. 12, no. 3, 1997, pp. 195–252. Web. 5 Dec. 2016.
  7. Bermúdez, José Luis. Cognitive Science: An Introduction to the Science of the Mind. Cambridge: Cambridge UP, 2010.
  8. Murray, Janet. Inventing the Medium. Cambridge, MA: MIT, 2012.
  9. LeCun, Yann, et al. “Very Deep Convolutional Networks for Natural Language Processing.” ArXiv: Computation and Language, 2016. Web. 15 Dec. 2016.
  10. Jackendoff, Ray. Foundations of Language: Brain, Meaning, Grammar, Evolution. Oxford: Oxford UP, 2002.
  11. Baars, Bernard J., and Nicole M. Gage. Cognition, Brain, and Consciousness: Introduction to Cognitive Neuroscience. Burlington, MA: Academic/Elsevier, 2010.
  12. geoffhinton [Geoffrey Hinton]. “AMA Geoffrey Hinton.” Reddit, 10 Nov. 2014, Accessed 13 Dec. 2016.
  13. yoshua_bengio [Yoshua Bengio]. “AMA: Yoshua Bengio.” Reddit, 24 Feb. 2014, . Accessed 13 Dec. 2016.
  14. michaelijordan [Michael I. Jordan]. “AMA: Michael I. Jordan.” Reddit, 11 Sep. 2014, Accessed 13 Dec. 2016.
  15. Big Think. “Steven Pinker: Linguistics as a Window to Understanding the Brain.” Online video clip. Youtube. YouTube, 6 October 2012. Web. 15 Dec. 2016
  16. Arbib, Michael A. Handbook of Brain Theory and Neural Networks. 2nd ed. Cambridge, MA: MIT, 2003.
  17. Wilson, Robert A., and Frank C. Keil. The MIT Encyclopedia Of The Cognitive Sciences. Cambridge, MA: MIT, 1999.
  18. Murphy, Kevin P. Machine Learning: A Probabilistic Perspective. Cambridge, MA: MIT, 2012.
  19. Marcus, Gary F. The Algebraic Mind: Integrating Connectionism and Cognitive Science. Cambridge, MA: MIT, 2001.
  20. Frankish, Keith, and William M. Ramsey. The Cambridge Handbook of Cognitive Science. Cambridge: Cambridge UP, 2012.