A quick analysis of Google’s Ngram

For this week, I selected one of the Google’s services that I have used frequently as a matter of curiosity and also a way to get general information, Ngram Viewer, which allows for searching  words in books over the years, from 1500 -2008.

Ngram is one of the most impactful Google’s endeavors. It is the result of the digitalization of millions of books and the creation of a search tool that scan the material altogether.  According to an article published on Science by researchers involved in this project (Michel et. al, Quantitative Analysis of Culture Using Millions of Digitized Books, Science, 2011), the “corpus” of books was formed by publications that come from over 40 universities libraries with more than 15 million books digitalized, which corresponds to 12% of all the books ever published. The researchers then selected 5 million publications (4% of all the books ever published) based on the quality of both the metadata providing date and place, which is made available by publishers and libraries, and the optical character recognition (OCR) results, which shows how precisely the system used to digitalize recognizes the letters and symbols printed.

f1-large

Source:  Michel et. al, Quantitative Analysis of Culture Using Millions of Digitized Books, Science, 2011

To properly interpret the results that the tool shows the user, it is necessary to understand how the platform works. “Gram” is a group of characters, including letters, symbols or numbers, without a space. A gram can be a word, a typo or a numerical representation (bag, bagg, 9.593.040). For instance, “bag” is a 1-gram, while “small bag” is a 2-gram. Ngram means a gram composed by “n” number of groups of character.  According to the Ngram information session, word search results are circumscribed to the type of gram one is searching for. If the user typed a 1-gram, the search will be conducted only among 1-grams. The same occurs with a 2-gram and so on.

In the example given by the Ngram programmers, they search for two 2-grams and one 1-gram at the same time: “nursery school”, “child care”, and “kindergarten”, respectively. The answer to be obtained with the platform will be: “… of all the bigrams contained in our sample of books written in English and published in the United States, what percentage of them are “nursery school” or “child care”? Of all the unigrams, what percentage of them are “kindergarten”?” (Please see the first chart at https://books.google.com/ngrams/info).

Thus, the results are dependent on the classification of the gram that one is searching for. In the case above, the dataset where “kindergarten” has been searched is different from the dataset where “nursery school” and “child care” have been searched.

On the other hand, beyond the fact that the platform allows to easily search for classes of words such as adjectives, verbs, nouns, pronouns, adverbs, allowing linguistic comparisons, the capacity of the platform to scan books is much higher than a human capacity. As Michel et. al explain, “If you tried to read only English-language entries from the year 2000 alone, at the reasonable pace of 200 words/min, without interruptions for food or sleep, it would take 80 years.” (Michel et. al, 2011)

What is interesting about Ngram is that it builds on social knowledge of hundreds of years stored in universities’ libraries transforming millions of physical books into a single digital file. Through a combinatorial process that joins physical materials and also software, such as OCR, search engine, databases, Ngram makes possible the creation of tools that are more than a remediation (Manovich, 2013) of old books and libraries given that with a searchable file many different comparisons and uses are now possible. The fact that it is owned by Google and was built on a project of Harvard scholars (Michel et. al, 2011) shows that societal conditions and previous knowledge, while not determinant, are fundamental to shape who will have chances to reproduce power.

Regarding the limitations of Ngram, an article on Wired (Zhang, Sarah, 2015, The pitfalls of using Google Ngram to study language) shows that the more one unveils how it functions, the more precaution is advisable.  One can’t disregard the fact that optical character recognition technologies (OCR) are not perfect and can incur in results errors when some pixels generated when scanning a book are not accurate. Zhang (2015) explains that fonts patterns in some publications can generate confusion between letters (e.g. s and f), what will generate mistakes. Metadata can also present some errors implying that some information comes from a specific year and place when, in fact, it does not.

From the point of view of the web architecture, Google’s servers are the unique source for the content shown in the Ngram platform. Despite the fact that the physical books are in many different physical places, a user can read a given book on Google books platform and then do the search on Ngram accessing the websites from a computer device no matter where they are. Because it is a proprietary platform, users so far can’t have access to the raw data or even any report that explains the amount of books, 1-gram, 2-gram searched per year or decade. The more transparency the platform offers, the more uses one can make with such a rich application. At the end of the day, the reality created by Ngram is based on no more than 4% of all books ever published according to the researchers who pioneered it. We should keep this in mind.

Finally, the centralization of knowledge in one big player has consequences for the users’ privacy, which is compromised when their searches are identifiable and added to their profile to improve marketing advertisements. I don’t know to what extent this has been currently done, but there is no reason to believe that this is not the case.