Unicode: A innovation creating commonalities between the black boxes data encoding

Before diving into the more complex attributes of exactly how it is that you are able to read these letters, words and phrases that I am typing out on my computer right now – aka these “data types, we can briefly describe what we commonly understand by “data”, at least in the context of computing and coding. Professor Irvine defines; “‘Data’, in our contexts of computing and information, is always something with humanly imposed structure, that is, an interpretable unit of some kind understood as an instance of a general type. […] (T)o be interpretable, [data] is something that can be named, classified, sorted, and be given logical predicates or labels (attributes, qualities, properties, and relations)” (Irvine, 2021, 1). As we briefly touched upon in our last class as well, a “token” can stand for something else, something that can be represented as something, something that is immediately related and connected to something else. Data can be a token or tokens. “Data is inseparable from the concept of representation” (Irvine, 2021, 2). Data alone would not stand for anything if it didn’t actually represent something. Focusing on this context of computing and information, representation means a “computable structure” and is “defined byte sequences capable of being assigned to digital memory and interpreted by whatever software layer or process corresponds to the type of representation — text character(s), types of numbers, binary arrays, etc”. And simply put, we can also say that this is why “data” is considered to be of a more ‘higher esteem’ than “information”. I Imagine information as being the biggest mass of general, undefined, ‘unsupervised’, facts, clues, concepts, etc, and no matter what it is it can just exist, it can ‘tell’ us something, it can let us know of something but it doesn’t have the purposefully structure nature and meaningful existential representation that “data” consists of. 

A part of this data, are the data types we all know as texts, images, sounds, etc. If I send the words “Hello! How are you?” from my iPhone to someone with a Samsung, they will receive letter by letter, symbol by symbol, the same thing. If I copy and paste the same phrase from WordPress, to my notes, to a personal message on Facebook, to someone else on Whatsapp, to a chat room on Twitch, etc., the same exact message will appear once again. The reason for this is Unicode. Unicode is the international standard, an “information technology standar (Wikipedia, 2021), that has been creates and formatted in order for all computing devices and software applications to interpret the same representation throughout the world. “Unicode is thus the universal internal language of computing” (Irvine 2021, 5). It is the data for representing written characters aka strings, “of any language by specifying a code range for a language family and a standard bytecode definition for each character in the language” (Irvine, 2021, 3). The reason why we are able to read text on any device, emails, messages, etc, is because of Unicode. “Unicode is what is inside the black boxes of our devices with pixel-based screens and software layers designed for text characters and emoji” (Irvine, 2021, 5). Joe Becker, in August of 1988 in his draft proposal for this character encoding system explains why even the name matches as it is “intended to suggest a unique, unified, universal encoding” (Wikipedia, 2021). 

Some fun/interesting facts about Unicode (Wikipedia 2021; Wisdom, 2021):

  • total of 143,859 characters 
  • 143,696 graphic characters 
  • 163 format characters 
  • 154 modern and historic scripts, symbols, emojis 
  • current version: 13.0  
  • Unicode cannot run out of space. If it were linear, we would run out in 2140 AD! 

The reason why Unicode works as a standard for everyone is because these data standards cannot align with a specific system, software, platform, etc., but needs to be “independent of any software context designed to use them. They can work with any software or device because they reference bytecode units, which are independent data. “What we see on our screens is a  software ‘projection’ of the bytecode interpreted in whatever selected font style for ‘rendering’ the pixel pattern on screens” (Irvine, 2021, 4). How it comes all together is with the aid of The Unicode Standard which uses code charts for the visual representation, encoding methods, standard character encodings, reference data files, character properties and rules, etc., “provides a unique number for every character, not matter what platform, device, application or language” (Unicode Technical Site, 2021). Which if you think about it, it is pretty cool that we were all able to agree on something (of course, without getting into the complications, issues, biases, etc. that come with adopting Unicode), but in a cliche way – technology did bring us (almost) all together! For text processing, unicode translates to a unique code point, a number, for each character so it represents the character in a more general computing format where the visual representation of it, i.e. font, shape, size, etc., is taken care of by different software, unicode provides the meaning, the what it is (Unicode Technical Site, 2021).

Unicode used different types of character encodings, the Unicode Transformation Format (UTF), “an algorithm mapping from every Unicode code point […] to a unique byte sequence (the ISO/IEC 10646 standard uses the term “UCS transformation format)” (Unicode Technical Site, 2021). Most commonly used are UTF-8, UTF-16, UTF-32. UTF-8 is the byte-oriented encoding form, it is the dominant encoding used on the World Wide Web and even the first 128 characters are ASCII (American Standard Code for Information Interchange) which means that they also are under UTF-8 (Unicode Technical Site, 2021; Wikipedia; 2021). 

 

Crash Course Computer Science 

Since emojis have to be bytecode definitions to be interpreted as software context, All emojis must have Unicode byte definitions in order to work for all devices, software, graphic rendering. Updates with new emojis or some emojies will not be consistent from one device to the next or from one system to the next, i.e. sometimes the typical red emoji heart ❤️ (note: the irony that this heart emoji looks different on my iOS system than on WordPress) would show up as a smaller dark black heart or maybe a box with a “?” would appear in that emojis place if you hadn’t updated the version, etc., Is this due to non-updated bytecode definitions? Or is it because the software/system didn’t use/follow the ISO/IEC standards? Is this the reason why each company/program/software has their own “look”/style for each emoji because that is how it transforms/translates into from the Unicode CLDR data? Does the same apply for unreadable fonts as mentioned in the readings with the problem that arises with Unicode? 

I’d like to further look into the connection between Unicode and software stack design? How do they connect to each other and how does one symbol go through the “journey” of Unicode to the process of adopting whatever font, size, color it is given. 

 

References 

Irvine, Martin. (2021). “Introduction to Data Concepts and Database Systems.”

Crash Course Computer Science 

John D. Kelleher and Brendan Tierney, Data Science (Cambridge, Massachusetts: The MIT Press, 2018). 

Unicode Wikipedia 

The Unicode Standard 

Emoji – Unicode 

Unicode is Awesome – Wisdom