Culturonomics – Think Outside the Box

by Eric Cruet

In 2011, a group of scientists — mostly in mathematics and evolutionary psychology — published an article in Science titled “Quantitative Analysis of Culture Using Millions of Digitized Books”.  The authors’ technique, called “culturomics,” would, “extend the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.” The authors employed a “corpus” of more than 5 million books — 500 billion words — that have been scanned by Google as part of the Google Books project.  These books, the authors assert, represent about 4 percent of all the books ever published, and will allow the kind of statistically significant analysis common to many sciences.

Their main method of analysis is to count the number of times a particular word or phrase (referred to as an n-gram) occurs over time in the corpus (Try your own hand at n-grams here).  A ‘one-gram’ plots the frequency of a single word such as “chided” over time; a ‘two-gram’ shows the frequency of a contiguous phrase, such as ‘touch base’ (see‘Think outside the box’).

Their full data set includes over 2 billion such “culturomic trajectories”.  One of the examples the authors give is to trace the usage of the year “1865”.  They note that “1865” was not discussed much before the actual year 1865, that it appeared a lot in 1865, and that its usage dropped off after 1865.  They call this evidence of collective memory.  Below is another example.

Google unveiled the tool on 16 December 2010.  One of the first notable discoveries was made by two Harvard postdocs, Lieberman Aiden and Jean-Baptiste Michel, also members of the team that published the original paper in Nature.  When comparing German and English texts from the first half of the twentieth century they discovered that the Nazi regime suppressed mention of the Jewish artist Marc Chagall, and that the n-grams tool could be used to identify artists, writers or activists whose suppression had hitherto been unknown.  They called their approach culturomics, a reference to the genomics-like scale of the literary corpus.  The term has evolved as a new scientific discipline of the digital humanities—the use of computer algorithms to search for meaning in large databases of text and media.

In the first 24 hours after its launch, the n-grams viewer ( received more than one million hits.  Dan Cohen, director of the Roy Rosenzweig Center for History and New Media at George Mason University in Fairfax, Virginia, calls the tool a “gateway drug” for the digital humanities, a field that has been gaining pace and funding in the past few years (see ‘A discipline goes digital’).  The name is an umbrella term for approaches that include not just the assembly of large-scale databases of media and other cultural data, but also the willingness of humanities scholars to develop the algorithms to engage with them.  

However, some scholars have deep reservations about the digital humanities movement as a whole — especially if it will come at the expense of traditional approaches.  Also, humanities researchers from traditional camps complain that their field can never be encapsulated by the frequency charts of words and phrases produced by an n-grams tool.  Comparing the contribution that books provide in the context of the cultural encyclopedia to the corresponding DNA strands of human experience is a dangerous proposition….or just a cultural posthumanist one?

Culturonomics 2.0 at TedX


Michel, J. B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., … & Aiden, E. L. (2011). Quantitative analysis of culture using millions of digitized books. science331(6014), 176-182.
Lin, Y., Michel, J. B., Aiden, E. L., Orwant, J., Brockman, W., & Petrov, S. (2012, July). Syntactic annotations for the google books ngram corpus. In Proceedings of the ACL 2012 System Demonstrations (pp. 169-174). Association for Computational Linguistics.
Aiden, L. (2011). Google Books, Wikipedia, and the future of culturomics.