It has always fascinated me when I observed my Japanese colleagues writing their monthly reports in their language and using the same computer to send us emails in English. In fact, the first question I had when we started deblackboxing the computing system, in this course, is how do I use the same device or the same system to send English texts to my English speaking friends and Arabic texts to Arabic speaking friends. In the third week, we were introduced to the binary model used by computing system “in its logic circuits and data.” It was easy to understand the representation of decimal numbers in a binary system but not letters or whole texts. The readings for this week deblackbox another layer in the computing system and differentiate between two methods of digital encoding; digital text data and digital image data. For this week’s assignment, I will try to reflect my understanding of how digital text data (for Natural Language Processing) are encoded to be interpreted by any software.
Before diving into the digital text data encoding, I will start by defining data. Professor Irvine’s reading for this week defines “data as something with humanly imposed structure, that is, an interpretable unit of some kind understood as an instance of a general type.” By interpretable, we mean that data is something that can be named, classified, sorted, and be given logical predicates or labels. It is also important to mention that without representation (computable structures representing types) there is no data. In this context, computable structures mean “byte sequences capable of being assigned to digital memory and interpreted by whatever software layer or process corresponds to the type of representation,” text characters in our case.
The story of digital text data encoding starts with ASCII (American Standard Code for Information Interchange) table A. Bob Bemer developed the ASCII coding model to standardise the way computing systems represent letters, numbers, punctuation marks and some control codes. In chart A below, you can see that each and every modern English letter (small and capital), punctuation mark and code has its equivalent in the binary system. The seven bits binary system used by ASCII represented 127 English letters and symbols only. While the bit patterns of the 127 printable ASCII characters are sufficient to exchange information in modern English, but most other languages need additional symbols that are not covered by ASCII.
ASCII sought to remedy this problem by utilizing the eighth bit in an 8-bit byte to allow positions for another 128 printable characters. Early encodings were limited to 7 bits because of restrictions of some data transmission protocols, and partially for historical reasons. At this stage, extended ASCII was able to represent 255 characters, as you can see in table B. However, as we read in Yajing Hu final project essay, Han characters, other Asian language families and much more international characters were needed that could fit in a single 8-bit character encoding. For that reason, Unicode was found to solve this problem.
Unicode is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems. It is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as “wide-body ASCII” that has been stretched to 16 bits to encompass the characters of all the world’s living languages. Depending on the encoding form we choose (UTF-8, UTF-16, or UTF-32), each character will then be represented either as a sequence of one to four 8-bit bytes, one or two 16-bit code units, or a single 32-bit code unit. However, UTF-8 is most common on the web. UTF-16 is used by Java and Windows. UTF-8 and UTF-32 are used by Linux and various Unix systems. The conversions between all UTFs are algorithmically based, fast and lossless. This makes it easy to support data input or output in multiple formats, while using a particular UTF for internal storage or processing.
The important question now is how can softwares interpret and format text characters of any language? To answer this question, I’ll go back to Professor Irvine’s definition of data “as something with humanly imposed structure, that is, an interpretable unit of some kind understood as an instance of a general type.” My takeaway here is that the only way for softwares to process text data is to represent characters in bytecode definitions. These bytecode definitions work independently from any software that is designed for them. With that said and in conclusion, unicode uses binary system (bytecode characters) designed to be interpreted as data type for creating instances of characters as inputs and outputs through any software.
- Peter J. Denning and Craig H. Martell, Great Principles of Computing (Cambridge,The MIT Press, 2015), p.35.
- Prof. Irvine, “Introduction to Data Concepts and Database Systems.”
- ASCII Table and Description.
- Han Ideographs in the Unicode Standard, (CCT).
- ISO/IEC 8859.
Kelleher reading: The constrains in data projects relate to what attributes to gather ad which attributes are most relevant to the problem we are solving. Who decides which data attributes to choose? Can we apply the principle of levels on data attributes?