One thing I have been interested in for a while was how devices like Amazon Alexa, Google Home, or Siri take in and process our words into text and then provide us with answers. From the Computer Science Crash Course video, it was explained that the acoustic signals of words are captured by a computer’s microphone. This signal is the magnitude of displacement of a diaphragm inside of a microphone as sound waves, which cause it to oscillate. We have graphable data to represent time and the vertical access is the magnitude of displacement (amplitude). The sound pieces that makeup words are called phonemes. Speech recognition software knows what all these phonemes look like because, in English, there are roughly 44 phonemes, so computer software essentially tries to pattern match. To separate words from one another, figure out when sentences begin and end, and obtain speech converted into text, techniques used include labeling words with parts of speech and constructing a Parse Tree (which not only tags every word with a likely part of speech, but also reveals how the sentence is constructed).
“You shall know a word by the company it keeps.” But, to make computers understand distributional semantics, we have to express the concept in math. One simple technique is to use Count Vectors. A count vector is the number of times a word appears in the same article or sentence as other common words. But an issue presented with count vectors is that we have to store a LOT of data, like a massive list of every word we’ve ever seen in the same sentence, and that’s unmanageable. To try to solve this problem, we use an encoder-decoder model: the encoder tells us what we should think and remember about what we just read and the decoder uses that thought to decide what we want to say or do. In order to define the encoder, we need to create a model that can read in any input we give it, i.e. a sentence. To do this, a type of neural network called a Recurrent Neural Network (RNN) was devised. RNNs have a loop in them that lets them reuse a single hidden layer, which gets updated as the model reads one word at a time. Slowly, the model builds up an understanding of the whole sentence, including which words came first or last, which words are modifying other words and other grammatical properties that are linked to meaning.
Stepping away from the more technical side of NLP and the devices we currently use, I wanted to note that I love the idea of a positive feedback loop. Because people say words in slightly different ways due to things like accents and mispronunciations, transcription accuracy is greatly improved when combined with a language model, which can take statistics about sequences of words. The more we use these devices that try to recognize speech and hear new accents, mispronunciations, etc, the better we can train our devices to understand what we are saying. Scary? Maybe. But also cool.
I’m extremely excited to be reading about natural language processing this week, as I loved the intro course I took in NLP last semester. One of the later assignments we had that reminded me of the Crash Course videos and some of the reading was called “Read training data for the Viterbi tagger.” For context, the Viterbi algorithm is essential for POS tagging but also great for signal processing (cell phone signal decoding), DNA sequencing, and WiFi error correction. Here were the instructions for the assignment:
- Read the training data
- Split the training file into a list of lines.
- For each line that contains a tab (“\t”), split it by tab to collect the word and part of speech tag.
- Use a dictionary to track frequencies for:
- Each word as each tag
- Each transition from the last tag to the next tag
- Sentence starting probabilities for each tag
- Divide by the total number of words to make probabilities and put them into the same nested dictionary structure used by the Viterbi tagger.
- Now test the tagger:
- read the test file
- Tag each sequence of words using the viterbi code
- Report in a comment: For how many tokens did the tagger find the right solution?
- Add an evaluation by sentences: for how many sentences is the tagger 100% correct? (include code to calculate this and report the accuracy in a comment)
Here is what my code looked like:
Questions:
Google’s version of this is called Knowledge Graph. At the end of 2016, it contained roughly 70 billion facts about, and relations between, different entities… Can you speak more about knowledge graphs, what is necessary to create one, and how they are stored? How does Google use this?
Citations