Category Archives: Week 5

Light to Digital to Data

Perhaps one of the concepts which has been the easiest to grasp for me thus far, has been how light is transformed into machine code, then RBG triplets and subsequently stored as data. The process from camera to digitization begins with electricity (as with most things concerning computation). Certain materials in conjunction with chemicals have a reaction when interacting with light, which results in electrical charges. (White p.68, 2007) These charges are processed through a semiconductor, an ADC (analog-to-digital Converter), and then into the microprocessor. (White p.69, 2007) Once the electrons reach the microprocessor, they are transformed into RGB (Red Green Blue) triplets. 

Inside each red, blue and green color spectrum is 256 different shades of those colors represented by bits. Within the entirety of the RGB spectrum of colors is 256^3 (16,777,216) different combinations of triplets. (White p.101, 2007)These RGB triplets are formed into an array and stored as data, but before this occurs, an algorithm is run to determine missing color values based on surrounding pixels. Once the array is produced, the data can be stored as either RAW, uncompressed, or lossy compressed files. (White p.112, 2007) One of the most common methods of storing images is as JPEG files or “jpg” which will be discussed in the following paragraph. (White p.112, 2007)   

When encoding digital images, standard formats are important for multiple reasons. One of the reasons a standard format like JPEG is so practical, is because of its ability to compress digital files. File compression is essential for the successful and efficient transfer of data over the internet. JPEG takes a digital image, and uses an algorithm which finds pixel colors recurring many times within a digital image, this being the “reference pixel.” (White p.113, 2007) Once all the reference pixels are determined, they are used to limit the file size of an image by compressing unnoticeably unique RGB triples into the standard reference pixel across the entire image. Furthermore, it is possible to control the level of compression in a JPG file to suit the needs of the user. (White p.113, 2007) The standardization of JPEG was critical for future innovations in the file encoding format. As a result of JPEG files being standardized across multiple operating systems and coding languages, it is easier to incorporate the technology in future updates and technological progression. 

Questions:

What is RGBA and how does it interact with regular RGB? I understand it has something to do with transparency. Why RGB instead of RYB? Red Yellow and Blue are the primary colors, green is a secondary, so it is confusing to me why green became one of the colors recognized in RGB triplets. What are some examples of non-standard format data types? Can standard formats sometimes have a limiting effect on innovation, as they might reduce the incentives for innovation? What is the difference between hexadecimal systems and binary representations of RGB triplets?

References

Digital images—Computerphile—Youtube. (n.d.). Retrieved February 22, 2021, from https://www.youtube.com/watch?v=06OHflWNCOE

Images, pixels and rgb. (n.d.). Retrieved February 22, 2021, from https://www.youtube.com/watch?v=15aqFQQVBWU

White, R. (2007). How digital photography works (2nd ed). Que.

How Data Works over the Level of E-Information

In digital computing discourse, the term “data” is also different from a traditional context like information is. According to Professor Irvine, there is “no data without representations.” For whatever types of data, they must be interpretable in any software layer, processable by the computing system, and storable as files in memory. That is, “any form of data representation must be computable; anything computable must be represented as a type of data.” This concept is also connected to the information theory. As we learned last week, information, in the digital computing context, is a physical concept. It is encoded to the electronic signals that are communicable in the transmitting channel but unobservable. In this case, “data” is more similar to the meta-information (which is the information in the generic-information sense).

Based on that, it could be explained why “data” is a level above “information.” E-information, at its level, structures “the code for data at next level up and code for operations, interpretations, and transformation of, or over, the representations.” Information could be understood that it plays the function role over the physical computer system level with its strings of binary codes so that data, at the next level up, could be interpretable by a human (as representations) and computable by the computer.

All formats, including text (like TXT) and images (like JPEG), are the same. They have “long lists of numbers, stored as binary on a storage device.” are encoded as “digital data.” In text format, words are coded by Unicode with different character encodings so that words (or visual symbols) can be represented on the computer in different languages. Take Emojis as an example. In the Emoji system, each emoji has a unique codepoint, and it could be combined with another codepoint to form a new emoji. For example, the code of “👶” (baby) is “1F476” (it will be transferred into binary code so that the computer could interpret it). It can be combined with another color code like “1F3FB”; then, we will get a baby with light skin tone- “👶🏻” (“1F476+1F3FB”). For other formats, like image, it works in a similar mode. Images are formed by pixels, which are combinations of three colors- red, green, and blue. “An image format starts with metadata (key values for image), such as image width, image height, and image color.” Colors on each pixel could be divided into three parts- red, green, and blue (each part has a maximum of 8 bits/ 1 byte). For example, “000” is white, which means it has zero intensity of red, green, blue (the biggest value on each color is 255). With each value on the pixel, we could get an image with certain amount pixels. In the process, the code for each pixel will also be transfer to binary code for computer interpretation.

Reference

Irvine, M. Introduction to Computer System Design, 2020.

Kelleher, J. D., and B. Tierney. Data Science. MIT Press, 2018. 

Irvine, M. Introduction to Data Concepts and Database Systems, 2021.

“Unicode.” In Wikipedia, February 21, 2021. https://en.wikipedia.org/w/index.php?title=Unicode&oldid=1008164095.

CrashCourse. 2017c. Files & File Systems: Crash Course Computer Science #20https://www.youtube.com/watch?v=KN8YgJnShPM&list=PL8dPuuaLjXtNlUrzyH5r6jN9ulIgZBpdo&index=21.

 

Question:

Can data be interpreted as a similar term for the meta-information in the digital computer sense?

 

 

Color, conversion, and data.

Data are an interesting concept because I work with data all day. Unlike information, data is much more meaningful as it coalesces a plethora of pieces of information to give them relationships to each other.

The two things I took away from all this are information about the SQL databases and the true phenomenon of color pictures.

SQL databases sound pretty boring from the outside but a lot of data is stored this way, especially institutional data. I never understood what it was or what organizations use it, especially from the outside it seems silly to store data in separate data frames from each other, needing to write code to retrieve it each time. But in learning about memory constraints and the way memory is written I start to understand it. SQL is a great way to store data when there are constraints, and when data can become infinitely long type and pattern. If you have several hundred million rows of data storing multiple pieces of information, it would be unwise to store it all in one place as it retrieving it would become nightmarish for wait times and leave you with a wall of information you most likely don’t want or need. It’s always been an interest of mine to learn SQL as it seemed like a very fundamental data language, and in understanding this, it is all the more important to understand.

I wanted to explain pictures as they fascinate me. Pictures are broken down into smaller and smaller pieces to become the size of one pixel, that pixel then contains 3 values which ultimately act like levers to create all the colors we can see on a screen. This ranges from white (which are all the colors turned on all the way) to black (which is all the color turned off). These values take the form of numbers. These numbers will change depending on the format you are using to read in and out the information, some programs having more range than others. This applies to file types as well, some file types are richer than others creating a need to adapt these higher-end files into smaller more readable files. Whenever this translation happens you lose something, be it color or resolution (the density of pixels).

Since the information is stored within 3 different colors (red, green, and blue), it requires 3x the space to store one pixel. These pixels ultimately lots of room as identifiable images might require 256, 3000, 10000 pixels on the low end. The more pixels the more we are able to discern smaller details in the background. It was only until the last decade where digital photography became the standard for professional photographers as film was always able to capture better more vibrant images than digital cameras.

The data is stored as binary with metadata informing the structure, size, and type of the file being handled. There many different ways to store this color information, as hexadecimal, as RGB color output but what is important in the conversion of that information into the format you are intending to use. Jpg which is widely used for documents loses some of the information on color and resolution to make files smaller, whereas .Png files favor rich data outputs of shrinking the file size. It’s important to note, each time you download and convert the file, you are always losing something so it’s important to try and download from as close to the source as possible.

Questions:

How do video games work in as sequence of colors and text as this is a moving and changing format?

Is the text in movies stored as text or as color?

At which point will we not need to compress information as information storage and transfer speeds will be good enough?

Technical Document for Data Science, Coding, and Use in Database, Computing and AI – Heba Khashogji

The many contexts and uses of the terms “information” and “data” make these terms perplexing and confusing outside an understood context. Using the method of thinking in levels and our contexts for defining data concepts, outline for yourself the concept of “data” and its meaning in two of the data systems we review this week. One “system” is the encoding of text data in Unicode for all applications in which text “data” is used; others are database management systems.

What is Data Science?

       Data science incorporates a set of principles, problem identification, algorithms, and processes for extracting unapparent and helpful patterns from large data sets. Many of the data science elements have been developed in related fields, such as machine learning and data mining. In fact, the terms data science, machine learning, and data mining are often used interchangeably. The commonality across these disciplines is a focus on improving decision making through the analysis of data. However, although data science borrows from these other fields, it is broader in scope. Machine learning (ML) emphases on the design and assessment of algorithms for extracting patterns from data. Data mining typically handle the examination of structured data and often suggests a focus on commercial applications.

A Brief History of Data Science.   

The term data science can be traced back to the 1990s. Nevertheless, the fields that it profits by having a much longer history. One thread in this more extended history is data collection history; another is the history of data analysis. In this section, we review the main developments in these threads and describe how and why they converged into the field of data science. Of necessity, this review introduces new terminology as we define and name the important technical innovations as they arose. For each new term, we provide a brief explanation of its meaning; we return to many of these terms later in the book and give a more detailed description of them. We begin with a history of data collection, then provide a history of data analysis, and, finally, cover data science development.  (Kelleher and Tierney, 2018). 

Document and Evidence

The word information commonly refers to bits, bytes, books, and other signifying objects, and it is convenient to refer to this class of objects as documents, using a broad sense of that word. Documents are essential because they are considered evidence. 

The Rise of Data Sets.

Academic research projects typically generate data sets, but in practice, it is generally impractical for anyone else to attempt to make further use of these data, even though significant research funders now mandate that researchers have a data management plan to preserve generated data sets and make them accessible.

Naming

Finding operations depend heavily on the names assigned to document descriptions and the named categories to which documents are assigned. Naming is a language activity and so inherently a cultural activity. For that, we introduce a brief overview of the issues, tensions, and compromises involved in describing collected documents. The notation can be codes or ordinary words. Linguistic expressions are necessarily culturally grounded and so unstable and, for that reason, are in conflict with the need to have stable, unambiguous marks if systems are to perform efficiently.

The First Purpose of Metadata: Description

The primary and original use of metadata is to describe documents. There are various types of descriptive metadata:  technical (to describe the format, encoding standards, etc.); administrative. These descriptions help in understanding a document’s character and in deciding whether to make use of it. Description can be instrumental, even if nonstandard terminology is used.

The Second Use of Metadata: Search

Thinking of metadata to describe individual documents reflects only one of the two roles of metadata. The second use of metadata is different: it emerges when you start with a query or with the description rather than the document—with the metadata rather than the data— when searching in an index. This second use of metadata is for finding, search and discovery. (Buckland, 2017). 

Both “information” and “data” are used in general and undifferentiated ways in ordinary and popular discourse. Still, to advance in our learning for AI and all the data science topics that we will study, we all need to be clear on these terms and concepts’ specific meanings. The term “data” in ordinary language is a vague, ambiguous term. We must also untangle and differentiate the uses and contexts for “data,” a key term in everything computational, AI, and ML.

No Data without Representation.

In whatever context and application, “data” is inseparable from the concept of representation. A good slogan should be “no data without representation” (which can be said of computation in general). By “representation”, we mean a computable structure, usually of “tokens” (instances of something representable) corresponding to “types” (categories or classes of representation, roughly corresponding to a symbolic class like text character, text string, number type, matrix/array of number values, etc.). (Irvine, 2021).

Knowledge of database technology increases in importance every day. Databases are used everywhere: They are fundamental components of e-commerce and other Web-based applications. They lay at the core across the organization’s operational and decision support applications. Databases are also used by thousands of workgroups and millions of individuals. It is assessed that there are more than 10 million active databases in the world today.

This book aims to teach the essential relational database concepts, technology, and techniques that you need to start a career as a database developer. This book fails to teach everything that matters in relational database technology. Still, it will give you adequate scope to create your databases and participate as a group member in developing a more immense, more complex database. (Kroenke et al., 2017).

The data type attribute (numeric, ordinal, nominal) affect the methods we can use to analyse and understand the data. Use to describe the distribution of values that an attribute takes and the more complex algorithms we use to identify the patterns of relationships between attributes. At the most basic level of analysis, numeric attributes allow arithmetic operations. The typical statistical analysis applied to numeric attributes is to measure the central tendency (using the mean value of the attribute) and the dispersion of the attributes’ values (using the variance or standard deviation statistics).

Machine Learning 101

The primary tasks for a data scientist are defining the problem, designing the data set, preparing the data, deciding on the type of data analysis to apply and evaluating, and interpreting the data analysis results. What the computer brings to this partnership is processing data and searching for patterns in the data. Machine learning is the field of study that develops the algorithms that computers follow to identify and extract data patterns. ML algorithms and techniques are applied primarily during the modelling stage of CRISP-DM. ML involves a two-step process.

First, an ML algorithm is applied to a data set to identify useful patterns in the data. Second, once a model has been created, it is used for analysis. (Kelleher and Tierney, 2018). 

References :

  1. Kelleher, J & Tierney, B (2018). Data Science. The MIT Press: London.
  2. Buckland, M. (2017). Information and Society. The MIT Press: London.
  3. Irvine, M. (2021). Universes of Data: Distinguishing Kinds and User of “Data” in Computing and Al Applications.
  4. Kroenke, D., Auer, D., Vandenberg, S. L., & Yodeer, R.C. (2017). Database Concept. Pearson: NY.

Unicode and DBMS in levels — Fudong Chen

Unicode system is a coding table which links all the written characters of any language to binary codes one by one. But these specific codes cannot be stored and expressed as fonts or characters of language if there is no encoding method to bridge the computer and the Unicode. The problem of black squares and gibberish in text file is very common when it comes to Chinese words or Japanese words. For example, when I downloaded a Japanese video game, it usually has a introduction and description of the games written in txt file. But in most cases, the txt file which should be in Japanese will show black squares and meaningless gibberish. That’s because I do not have the specific transformation for Japanese or my default encoding method does not fit the txt file.

The Unicode system can be managed in levels:

First, application levels. The data in this level is the character shapes to pixel patterns on the digital screens. The data we input with peripherals like keyboards will be translated by Unicode system and finally showed on the screens as string.

Second, logic and language levels. The data is the characters and corresponding codes. The Unicode is a bridge links characters and binary codes. Every specific characters will have their own binary codes one by one. The text input by computer programs can be translated to its own codes and then translated again into representations of the computer.

Third, physical level. The data is bit units stored in disk or RAM. The Unicode codes will be translated into specific bit units by encoding method like Utf-8 so that they can be stored in the computer. Actually, the characters and the bit units is not in one-to-one correspondence. Different characters can be stored in different bit units with different encoding method.

In short, characters(users) ⇌ Unicode ⇌ encoding method(UTF-8) ⇌ bytes ⇌ disk, network, RAM

 

Refers to the DBMS, I do not have experience of it, but I can try to explain the system in levels:

First, application level. Data is what we see and input through the applications which are designed for the users and ask users to input specific format data through a data entry form. This kind of data would be transmitted to the DBMS and then be translated by DBMS into data packets to the database. The packets from database to the DMBS will also be translated into the information people can understand directly as a result and then shown on the applications. The result might be a data form or some specific error signs like wrong input.

Second, system level. Data is SQL statements and code structures of computers. The parser and grant checking will check the SQI statements from application. Then the data will be transmitted to semantic analysis and query treatment for understanding and classification. The access management, concurrency control and recovery mechanism will work according to the types of data and send the instruction to the database.

Third, physical level. Data is bit units. The system distinguishes different bit units in representations, identifies the access types, and find the matched stored placed in database according to the data types and system file organizations.

Question:

First, metadata is considered as the data about data. I am confused about the statement. What’s the difference between metadata and the data describing the rules how to use data in the system? Does the latter belong to the former?

Second, in the reading, I learn about how the Unicode system showed the characters on the screen. But I find that the example in the introduction to data and database that it uses 6 characters to constitute a character on the screen. So can I consider that the characters in language are not one by one corresponded to the Unicode?

Reference

Buckland, M. K. (2017). Information and society. The MIT Press.

Irvine, M. (n.d.). CCTP-607 Universes of Data: Distinguishing Kinds and Uses of “Data” in Computing and AI Applications. 9.

Kroenke, D. M., Auer, D. J., Vandenberg, S. L., & Yoder, R. C. (2017). Database concepts (Eighth edition). Pearson.

The Basics of Human Communication – Digitally

Best put, the term “’data’” is inseparable from the concept of representation. In the contexts of computing and information, data is always a humanly imposed structure that must be represented through an interpretable unit of some kind.

All text characters work with reference to an international standard for representing text characters in standard bytecode definitions. In effect, Unicode is designed by using bytecode characters, which are designed to be interpreted as a data type for creating instances of character, followed by interpretation in the software stack design, which projects character shapes to pixel patterns on the specific screens of a device as its output. Unlike some of the other types of communication we have spoken about like images or gifs or mp3 files, Unicode provides a set of software-interpretable numbers for representing the form of the whole representable character. A binary media file (like those previously mentioned) has no predefined form or size (for memory). I find a personal example of using Unicode to be funny.  In a web design computer science course I took, I was taught to put in the line UTF-8 to be able to help with the setup of a website. Not until reading the Wikipedia page did I realize that line represented one of the most commonly used encodings.

The second way in which we define a concept of data is through database management systems. This relies on a client/server relationship The client-side reflects the software interfaces for creating, managing, and querying the database on a user/manager’s local PC, while the server-side would be the database management system that is installed and running on a computer in a data center or Cloud array of servers. An example of a relational database model is SQL; SQL uses “Structured Query Language to create a database instance, input data, manage updates, and output data-query results to a client interface,” with which the client can “‘query’ (ask questions or search) data in the system.” As an aside to this definition of DBMS as a concept of data, I think something that has helped me deblackbox this idea is the database course I am taking currently. We have not even started to learn SQL, but instead, we are given a problem and hand draw the given data and its relation to the other data. This signifies to me just how much of a human-centered process database design is, it is not just magical Oracle taking care of everything; “a well-designed database is a partial map of human logic and pattern recognition for a defined domain.” I think that this understanding I have gained can be summed up in Kelleher’s Data Science: “One of the biggest myths is the belief that data science is an autonomous process that we can let loose on our data to find the answers to our problems. In reality, data science requires skilled human oversight throughout the different stages of the process. Human analysts are needed to frame the problem, to design and prepare the data, to select which ML algorithms are most appropriate, to critically interpret the results of the analysis, and to plan the appropriate action to take based on the insight(s) the analysis has revealed.” 

Interestingly enough, digital images seem to be defined by data similar to a combination of the way we use DBMS and Unicode (for text). “Digital cameras store photographs in digital memory, much like you save a Word document or a database to your computer’s hard drive. Storage begins in your camera and continues to your personal computer.” To get more into the ‘nitty-gritty,’ an image is stored as a huge array of numbers and in digital photography, the three colors – red, blue, and green—can have any of 256 shades, with black as 0 and the purest rendition of that color as 255. The colors on your computer monitor or the little LCD screen on the back of your camera are grouped in threes to form tiny full-color pixels, millions of them. When you change the values of adjoining colors in the pixels, you suddenly have 17 million colors at your disposal for each pixel. Essentially, we are express colors as numbers, and these pixelated colors form an image. This is similar to Unicodes expression of text as a number, not a glyph, for each character we attempt to encode.

Kelleher, J. D., and B. Tierney. Data Science. MIT Press, 2018. 

 

“Unicode.” In Wikipedia, February 21, 2021. https://en.wikipedia.org/w/index.php?title=Unicode&oldid=1008164095.

White, R., and T. E. Downs. How Digital Photography Works. Que, 2005. 
 
 
 
Question:
 
If we say all data must be able to be represented in order to be considered data, why is there a separate definition for “data” as representable logical and as conceptual structures?

Unicode as a master key

Data as a layer above information is something given. “Data,” to be interpretable, is something that can be named, classified, sorted, and be given logical predicates or labels (attributes, qualities, properties, and relations) (Irvine, 2021). And since data needed to be representable and designed to the form that will apply to many platforms and devices, to transform from the string of pure data into the extension of a comprehensive human symbolistic sign, everything makes sense. The problem occurred decades ago when scientists first started wanting to share data with each other. In simple words, the architectures pre-script the language while building the computer, so the device can “work-on its own” to no small degree to process the data. The scientists simply needed to decode the result based on their codebook. However, because each computer was designed based on different codebooks, sharing and exchanging the data became a problem. It required the technologists to comprehend more info of different data types and also block the connections between devices (Instructions & Programs: Crash Course Computer Science #8, 2017, 03:15-05:21). 

Unicode was designed to solve these problems. As professor Irvine mentioned: “Unicode is the data “glue” for representing the written characters (data type: “string”) of any language by specifying a code range for a language family and standard bytecode definitions for each character in the language) (Irvine, 2021). Based on my understanding, all of the text we see is not actually word by word, and it is still pictures by pictures combined together. Taking Chinese word as an example. When we see a character:

尛 = + +

And each part in the square was build up by a sequence of binary code, combined by the selecting font. After processing hardware and software, we see the final combination of all three images generated a total character pixel by pixel. Unicode provides interpretable data that are accessible for electronic devices. And that’s also the reason why we need to use Unicode. 

Many devices have a problem that they could only process a binary language system, especially Latin-originated characters rather than multi-language systems, simply because they share more common basic data info. In order to solve this problem, the developers use the strings from Unicode to allow devices to decode and understand the language on their own to further communicate with each other. In this case, using Unicode saves more storage and more easily for users to process software locally. 

References 

Irvine, Martin. (2021). “Introduction to Data Concepts and Database Systems.”

Instructions & Programs: Crash Course Computer Science #8. (2017, April 12). [Video]. YouTube. https://www.youtube.com/watch?v=zltgXvg6r3k&list=PL8dPuuaLjXtNlUrzyH5r6jN9ulIgZBpdo&index=9

Unicode: A innovation creating commonalities between the black boxes data encoding

Before diving into the more complex attributes of exactly how it is that you are able to read these letters, words and phrases that I am typing out on my computer right now – aka these “data types, we can briefly describe what we commonly understand by “data”, at least in the context of computing and coding. Professor Irvine defines; “‘Data’, in our contexts of computing and information, is always something with humanly imposed structure, that is, an interpretable unit of some kind understood as an instance of a general type. […] (T)o be interpretable, [data] is something that can be named, classified, sorted, and be given logical predicates or labels (attributes, qualities, properties, and relations)” (Irvine, 2021, 1). As we briefly touched upon in our last class as well, a “token” can stand for something else, something that can be represented as something, something that is immediately related and connected to something else. Data can be a token or tokens. “Data is inseparable from the concept of representation” (Irvine, 2021, 2). Data alone would not stand for anything if it didn’t actually represent something. Focusing on this context of computing and information, representation means a “computable structure” and is “defined byte sequences capable of being assigned to digital memory and interpreted by whatever software layer or process corresponds to the type of representation — text character(s), types of numbers, binary arrays, etc”. And simply put, we can also say that this is why “data” is considered to be of a more ‘higher esteem’ than “information”. I Imagine information as being the biggest mass of general, undefined, ‘unsupervised’, facts, clues, concepts, etc, and no matter what it is it can just exist, it can ‘tell’ us something, it can let us know of something but it doesn’t have the purposefully structure nature and meaningful existential representation that “data” consists of. 

A part of this data, are the data types we all know as texts, images, sounds, etc. If I send the words “Hello! How are you?” from my iPhone to someone with a Samsung, they will receive letter by letter, symbol by symbol, the same thing. If I copy and paste the same phrase from WordPress, to my notes, to a personal message on Facebook, to someone else on Whatsapp, to a chat room on Twitch, etc., the same exact message will appear once again. The reason for this is Unicode. Unicode is the international standard, an “information technology standar (Wikipedia, 2021), that has been creates and formatted in order for all computing devices and software applications to interpret the same representation throughout the world. “Unicode is thus the universal internal language of computing” (Irvine 2021, 5). It is the data for representing written characters aka strings, “of any language by specifying a code range for a language family and a standard bytecode definition for each character in the language” (Irvine, 2021, 3). The reason why we are able to read text on any device, emails, messages, etc, is because of Unicode. “Unicode is what is inside the black boxes of our devices with pixel-based screens and software layers designed for text characters and emoji” (Irvine, 2021, 5). Joe Becker, in August of 1988 in his draft proposal for this character encoding system explains why even the name matches as it is “intended to suggest a unique, unified, universal encoding” (Wikipedia, 2021). 

Some fun/interesting facts about Unicode (Wikipedia 2021; Wisdom, 2021):

  • total of 143,859 characters 
  • 143,696 graphic characters 
  • 163 format characters 
  • 154 modern and historic scripts, symbols, emojis 
  • current version: 13.0  
  • Unicode cannot run out of space. If it were linear, we would run out in 2140 AD! 

The reason why Unicode works as a standard for everyone is because these data standards cannot align with a specific system, software, platform, etc., but needs to be “independent of any software context designed to use them. They can work with any software or device because they reference bytecode units, which are independent data. “What we see on our screens is a  software ‘projection’ of the bytecode interpreted in whatever selected font style for ‘rendering’ the pixel pattern on screens” (Irvine, 2021, 4). How it comes all together is with the aid of The Unicode Standard which uses code charts for the visual representation, encoding methods, standard character encodings, reference data files, character properties and rules, etc., “provides a unique number for every character, not matter what platform, device, application or language” (Unicode Technical Site, 2021). Which if you think about it, it is pretty cool that we were all able to agree on something (of course, without getting into the complications, issues, biases, etc. that come with adopting Unicode), but in a cliche way – technology did bring us (almost) all together! For text processing, unicode translates to a unique code point, a number, for each character so it represents the character in a more general computing format where the visual representation of it, i.e. font, shape, size, etc., is taken care of by different software, unicode provides the meaning, the what it is (Unicode Technical Site, 2021).

Unicode used different types of character encodings, the Unicode Transformation Format (UTF), “an algorithm mapping from every Unicode code point […] to a unique byte sequence (the ISO/IEC 10646 standard uses the term “UCS transformation format)” (Unicode Technical Site, 2021). Most commonly used are UTF-8, UTF-16, UTF-32. UTF-8 is the byte-oriented encoding form, it is the dominant encoding used on the World Wide Web and even the first 128 characters are ASCII (American Standard Code for Information Interchange) which means that they also are under UTF-8 (Unicode Technical Site, 2021; Wikipedia; 2021). 

 

Crash Course Computer Science 

Since emojis have to be bytecode definitions to be interpreted as software context, All emojis must have Unicode byte definitions in order to work for all devices, software, graphic rendering. Updates with new emojis or some emojies will not be consistent from one device to the next or from one system to the next, i.e. sometimes the typical red emoji heart ❤️ (note: the irony that this heart emoji looks different on my iOS system than on WordPress) would show up as a smaller dark black heart or maybe a box with a “?” would appear in that emojis place if you hadn’t updated the version, etc., Is this due to non-updated bytecode definitions? Or is it because the software/system didn’t use/follow the ISO/IEC standards? Is this the reason why each company/program/software has their own “look”/style for each emoji because that is how it transforms/translates into from the Unicode CLDR data? Does the same apply for unreadable fonts as mentioned in the readings with the problem that arises with Unicode? 

I’d like to further look into the connection between Unicode and software stack design? How do they connect to each other and how does one symbol go through the “journey” of Unicode to the process of adopting whatever font, size, color it is given. 

 

References 

Irvine, Martin. (2021). “Introduction to Data Concepts and Database Systems.”

Crash Course Computer Science 

John D. Kelleher and Brendan Tierney, Data Science (Cambridge, Massachusetts: The MIT Press, 2018). 

Unicode Wikipedia 

The Unicode Standard 

Emoji – Unicode 

Unicode is Awesome – Wisdom 

 

Data – Its All Starting to Make Sense!!!

I think something just clicked! Slowly starting to make sense of binary and how it is converted into the symbols we see on our screens. First, I need to define Data which is always something with humanly imposed structure, that is, an interpretable unit of some kind understood as an instance of a general type. Data is inseparable from the concept of representation. This representation must be universal so that devices can communicate with other devices, enter Unicode. Unicode is literally just that a universal code that gives a string of binary digits for each symbol, number, letter, etc. that at its current standard has enough space to generate 2,147,483,647 symbols. The first 127 symbols come from ASCII which uses a 7-bit structure and could only generate 127 symbols. We can get the extended ASCII scale that can generate 8-bit structures but there are so many more symbols that needs associated binary digits so Unicode developed a world system that most commonly uses UTF-8 or UTF-16 but can go up to UTF-32. So, for me to understand this better here is an example:

It is easy to use binary digits to represent numbers with the whole 64, 32, 16, 8, 4, 2, 1 (7-bit/ASCII) sequence in which 82 would represent 1010010 but to represent letters or symbols there needs to be a universal agreement on what binary digits represent. So, each letter or symbol is given a number such that “a” means 65 which means 1000001.

Now if every computer uses this same method of symbology (a.k.a. bytecode definitions) then they can communicate with each other which is why a universal standard is so important. The next question is then how do you get a computer to create the symbol “a.” I understand that the binary digits 1000001 = a but how does a pop up on my screen? How does “a” pop up on my screen, how is it converted from binary digits into the letter i.e. rendered? Professor Irvine mentioned it in his intro in which it seems like a software interprets it and then displays the text on a screen, so maybe its next week’s lesson?

This is just for text though. To understand photos is not that different which is wild! After reading How Digital Photography Works I no longer need to hire a professional photographer to take photos for me I know how to alter pictures! Joking but the basics are there and I’ve de-black boxed it! In simplest terms colors are composed of 256 numbers of each shade of red, blue, and green. So, to alter a pictures colors on just needs to change the number associated with that color. Black being 0 of all three which is the absence of color and white being 256 of all three. Though to get from an image that I see into something digital it goes through some cool science that if I did not know better would be a form of magic. The down and dirty, after light passes through a camera’s lens, diaphragm and open shutter it hits millions of tiny micro lenses that capture the light to direct it properly. The light then foes through a hot mirror that lets visible light pass and reflect invisible infrared light that would just distort images. Then it goes through a layer that measures the colors captured, the usual design is the Bayer array which has green, red, and blue separated and never touching the same color but double the number of greens. Finally, it strikes the photodiodes which measure the intensity of the light by first hitting the silicon at the “P-layer” which transforms the lights energy into electrons creating a negative charge. This charge is drawn into the diodes depletion area because of the electric field the negative charge creates with the “N-layers” positive charge. Each photodiode collects photons of light as long as the shutter is open, the brighter a part of the photo is the more photons have hit that section. Once the shutter closes the pixels have electrical charges that are proportional to the amount of light received. Then it can go through two different process either CCD (charge-coupled device) or CMOS (complementary metal-oxide semiconductor). Either process the pixels go through an amplifier that converts this faint static electricity into a voltage in proportion to the size of each charge. A digital camera literally converted light int electricity! MAGIC, joking its Science once you understand it! My question is then how does the computer recognizes the binary digits associated with the electric current? More precisely where in this process does the electric current become a recognizable number on the 256 green, red, blue scale?

Now that I understand the different types of data how to we access and store it. The crash course videos after watching a couple of times provided the answers! Data is structured to make it more accessible. It is stored on Hard Disk Drives and Hard Drives which are the evolutions of years of different research on storing data, which originated with paper punch cards (wild). Hard Disk Drives from my understanding is what our computers use for RAM because it has the lowest “seek time” (time it takes to find the data) by utilizing a memory hierarchy to manage and store the data. Hard Drives are RAM integrated circuits nonvolatile solid-state drives (SSD) that contain no moving parts but still not as fast as Hard Disk Drives. I am not sure if this is correct though because I more familiar with hearing the term Hard Drive rather than Hard Disk Drive, and associate disk technology with old computers. So, my question is what type of memory storage do most modern computers use or do the use both? Anyway, after understanding where data is located, the next step is understanding how it is organized in that storage system. Data is stored in file formats like JPEG, TXT, WAV, BMP, etc. which are stored back-to-back in a file system. At the front, a Directory file or Root File, is kept at the front of storage (location 0) and list the names of all the other files to help identify the files types. Modern file system stores files in blocks with slack space so that the a user can add more data to that file. If it exceeds its slack space it creates another block. This fragmentation of data goes through a defragmentation process that reorders data to facilitate ease of access and retrievability.

 

References:

CrashCourse. 2017a. Data Structures: Crash Course Computer Science #14. https://www.youtube.com/watch?v=DuDz6B4cqVc&list=PL8dPuuaLjXtNlUrzyH5r6jN9ulIgZBpdo&index=15.

———. 2017b. Memory & Storage: Crash Course Computer Science #19. https://www.youtube.com/watch?v=TQCr9RV7twk&list=PL8dPuuaLjXtNlUrzyH5r6jN9ulIgZBpdo&index=20.

———. 2017c. Files & File Systems: Crash Course Computer Science #20. https://www.youtube.com/watch?v=KN8YgJnShPM&list=PL8dPuuaLjXtNlUrzyH5r6jN9ulIgZBpdo&index=21.

“FAQ – UTF-8, UTF-16, UTF-32 & BOM.” n.d. Accessed February 21, 2021. https://unicode.org/faq/utf_bom.html.

Martin Irvine. 2020. Irvine 505 Keywords Computation. https://www.youtube.com/watch?v=AAK0Bb13LdU&feature=youtu.be.

The Tech Train. 2017. Understanding ASCII and Unicode (GCSE). https://www.youtube.com/watch?v=5aJKKgSEUnY.

“White-Downs-How Digital Photography Works-2nd-Ed-2007-Excerpts-2.Pdf.” n.d. Google Docs. Accessed February 21, 2021. https://drive.google.com/file/d/1Bt5r1pILikG8eohwF1ZnQuv5eNL9j8Tv/view?usp=sharing&usp=embed_facebook.

Digital Data: Encoding Digital Text Data- Chirin Dirani

It has always fascinated me when I observed my Japanese colleagues writing their monthly reports in their language and using the same computer to send us emails in English. In fact, the first question I had when we started deblackboxing the computing system, in this course, is how do I use the same device or the same system to send English texts to my English speaking friends and Arabic texts to Arabic speaking friends. In the third week, we were introduced to the binary model used by computing system “in its logic circuits and data.” It was easy to understand the representation of decimal numbers in a binary system but not letters or whole texts. The readings for this week deblackbox another layer in the computing system and differentiate between two methods of digital encoding; digital text data and digital image data. For this week’s assignment, I will try to reflect my understanding of how digital text data (for Natural Language Processing) are encoded to be interpreted by any software.

Before diving into the digital text data encoding, I will start by defining data. Professor Irvine’s reading for this week defines “data as something with humanly imposed structure, that is, an interpretable unit of some kind understood as an instance of a general type.” By interpretable, we mean that data is something that can be named, classified, sorted, and be given logical predicates or labels. It is also important to mention that without representation (computable structures representing types) there is no data. In this context, computable structures mean “byte sequences capable of being assigned to digital memory and interpreted by whatever software layer or process corresponds to the type of representation,” text characters in our case.

The story of digital text data encoding starts with ASCII (American Standard Code for Information Interchange) table A. Bob Bemer developed the ASCII coding model to standardise the way computing systems represent letters, numbers, punctuation marks and some control codes. In chart A below, you can see that each and every modern English letter (small and capital), punctuation mark and code has its equivalent in the binary system. The seven bits binary system used by ASCII represented 127 English letters and symbols only. While the bit patterns of the 127 printable ASCII characters are sufficient to exchange information in modern English,  but most other languages need additional symbols that are not covered by ASCII. 

 Table A

ASCII sought to remedy this problem by utilizing the eighth bit in an 8-bit byte to allow positions for another 128 printable characters. Early encodings were limited to 7 bits because of restrictions of some data transmission protocols, and partially for historical reasons. At this stage, extended ASCII was able to represent 255 characters, as you can see in table B. However, as we read in Yajing Hu final project essay, Han characters, other Asian language families and much more international characters were needed that could fit in a single 8-bit character encoding. For that reason, Unicode was found to solve this problem. 

Table B


Unicode is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems. It is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as “wide-body ASCII” that has been stretched to 16 bits to encompass the characters of all the world’s living languages. Depending on the encoding form we choose (UTF-8, UTF-16, or UTF-32), each character will then be represented either as a sequence of one to four 8-bit bytes, one or two 16-bit code units, or a single 32-bit code unit. However, UTF-8 is most common on the web. UTF-16 is used by Java and Windows. UTF-8 and UTF-32 are used by Linux and various Unix systems. The conversions between all UTFs are algorithmically based, fast and lossless. This makes it easy to support data input or output in multiple formats, while using a particular UTF for internal storage or processing.

The important question now is how can softwares interpret and format text characters of any language? To answer this question, I’ll go back to Professor Irvine’s definition of data “as something with humanly imposed structure, that is, an interpretable unit of some kind understood as an instance of a general type.” My takeaway here is that the only way for softwares to process text data is to represent characters in bytecode definitions. These bytecode definitions work independently from any software that is designed for them. With that said and in conclusion, unicode uses binary system (bytecode characters) designed to be interpreted as data type for creating instances of  characters as inputs and outputs through any software.

References

  1. Peter J. Denning and Craig H. Martell, Great Principles of Computing (Cambridge,The MIT Press, 2015), p.35.
  2. Prof. Irvine, “Introduction to Data Concepts and Database Systems.”
  3. ASCII Table and Description.
  4. ASCII.
  5. Han Ideographs in the Unicode Standard, (CCT).
  6. ISO/IEC 8859. 

Questions

Kelleher reading: The constrains in data projects relate to what attributes to gather ad which attributes are most relevant to the problem we are solving. Who decides which data attributes to choose? Can we apply the principle of levels on data attributes?