Can You Read This? Thank a Data Scientist!

Daniel Keys Moran, an American computer programmer and science fiction writer, once said, “You can have data without information, but you cannot have information without data.” This seems like a fairly straightforward way of distinguishing between data and information, right? Data is everywhere; artificially intelligent machine learning software is embedded in nearly all of our technological devices, monitoring and recording our every digital move, to the point where almost every aspect of our daily activity (even sometimes when we’re offline!) is quantified and turned into data. In the digital realm, information derives from the context that gives meaning to this ever-increasing stockpile of data.

As all of this digital information continues to grow and becomes more diverse and complex, the need arises to better classify and categorize all that data. Dr. Irvine, a professor of Communication, Culture & Technology at Georgetown University, explains the differences between different types of datasets by splitting them into subgroups. He begins by writing, “Any form of data representation is (must be) computable; anything computable must be represented as a type of data. This is the essential precondition for anything to be “data” in a computing and digital information context” (Irvine, 2019, p. 2). According to Irvine (2019), “data” can be seen as:

  • Classified, named, or categorized knowledge representations (tables, charts, graphs, directories, schedules, etc., with or without a software and computational representation)
  • Information structures (represented in units in bits/bytes, such as internet packets)
  • Types of computable structures (text characters & strings, types of numbers, emojis, etc., with standard byte code representations)
  • Structured vs. unstructured data
    • Structured – database categorized and labeled
    • Unstructured – data transmitted in email, texts, social media, etc. that is stored in data services (like “the cloud”)
  • Representable logical and conceptual structures: an ‘object’ with a class/category and various attributes or properties assigned and understood
  • ‘Objects’ in databases, as units of knowledge representation (such as all items in an Amazon category, or the full list of different movies directed by Quentin Tarantino in IMDb)
  • Decomposed into values and distributions in ML nodal algorithms, such as data points in a graph

One subgroup of data I found to be particularly interesting was the types (as in typing on a keyboard) of computable structures, including The International Unicode Standard. As Irvine (2019) writes, “Unicode is the data ‘glue’ for representing the written characters (data type: “string”) of any language by specifying a code range for a language family and standard bytecode definitions for each character in the language” (p. 3). This includes all the letters, numbers, symbols, and accents of different languages, as well as special characters from math and science and even emojis! Each of these minor representations of language and expression has its own specific set of bytes to be rendered before we can make meaning out of it.

According to the Unicode website, there are over 1700 different emojis for a modern digital keyboard (when taking into account the different skin tone variations of each), and each is represented slightly differently across platforms like Google, Facebook, and Twitter. That’s a LOT of data, packaged as information, and stuffed into our sleek and organized emoji libraries before projecting “character shapes to pixel patterns on the specific screens of devices” (Irvine, 2019, p.4).

As you can see, we rely on databases to store, categorize, retrieve, and render information for us in a myriad of ways on a daily basis. From sending emails, to shopping on Amazon, to choosing a show on Netflix, to checking the statistics of your favorite athlete or team, to simply using a smartphone app, databases are always at work, collecting, distributing, and computing information at a clip that’s extremely hard to fathom for the average person.

However, while these databases may seem “artificially intelligent” and autonomous (and many are equipped with machine learning AI algorithms to expedite their processes), they still must be designed, created, coded, managed, and maintained by human computer scientists. In their book Data Science, Kelleher and Tierney (2018) confirm that the total autonomy of these complex databases is a popular myth, saying, “In reality, data science requires skilled human oversight throughout the different stages of the process. Human analysts are needed to frame the problem, to design and prepare the data, to select which ML algorithms are most appropriate, to critically interpret the results of the analysis, and to plan the appropriate action to take based on the insight(s) the analysis has revealed” (p. 34).

So take a moment to appreciate the impressive work that data scientists do, even if most of it is behind the scenes (or behind the screens, if you will). We owe a lot of our digital luxuries to their difficult, meticulous jobs. And for that, I say 👍👏😁.

 

References
Irvine, M. (2019). Distinguishing Kinds and Uses of “Data” in Computing and AI Applications. Retrieved from https://drive.google.com/open?id=1C0zQ9md4WG5VswVdBOCkyw28L39HGZXv
Kelleher, J. D., & Tierney, B. (2018). Data science. Cambridge, Massachusetts: The MIT Press.
Moran, D. K. (n.d.). BrainyQuote. Retrieved from https://www.brainyquote.com/quotes/daniel_keys_moran_230911?src=t_data
The Unicode Consortium. https://unicode.org/