Category Archives: Week 6

Unicode and Emoji, Bitmoji Systems

Annaliese Blank

For this week, I wanted to take more time to unpack these readings and really understand “data” in all of its capacity. Apologies on the late post. I needed more time to digest this.

To start off, I began with the Irvine reading. My main goal was to address what is data? We are so lucky to call him our professor because he specifically laid out what it’s means to define data and how its constructed in various terms.

A question that came to mind was does big data only store unstructured data? Or would it be a combination of both structured and unstructured data? This question came to mind for me when I think about cloud software and how that’s a very organized space that functions based off unstructured data since it can store pretty much anything. But, inside the cloud, it’s easy to find whatever you need since it is stored properly based on what it’s made of and how its labeled. I’m curious to know if the cloud operates in the same way, by using structured and unstructured data in order to virtually store it for you. This idea came to me when I was thinking about the representation aspect to this topic.

A great way to think about data and define it would be through Unicode. Ever since items like the emoji or bitmoji were released I’ve always wondered how they operate. These are forms of communication that we all use in our day to day messages without really realizing what we’re sending. He says, “what we see on our screens is a software ‘projection’ of the bytecode, interpreted in whatever selected font style for ‘rendering’ the pixel patterns on screens” (Irvine, pg. 4). When I also think about these relational databases that he mentions later, I think of Excel spreadsheets, where different entries can be labeled and organized into specific tables. I was still a bit confused on the difference between this and NoSQL in terms of “container structures”.

Another way that I can think about this data and its meaning would be the section where Unicode emoji is mentioned. It says, “Emoji’s are pictographs, pictorial symbols, that are typically presented in a colorful form and use inline in text.” (Unicode Emoji, pg.1). I then looked at the Unicode Emoji Data Files. In here lies several documents that explain and display the data and codes that produce and send the emoji’s that we are able to see on our phones.

After reading through these, I would gather my own synopsis of data is based on structured or unstructured inputs, categorical or numerical, that are designed with sending purposes for collection or presentation, and the way in which it happens is through these bits and bytes and forms of Unicode that allow us to see something easier, vs. not knowing how it actually got there or what it’s entity truly is in code form.

Another interesting thing I wanted to look at was Amazon’s RDS. It says this is able to store data through a fast cloud performing base called “Aurora, Maria, Oracle, and SQL Servers. It focuses on management, security, fixes, and global access to other databases instantaneously. Its high speed functionality allows it to store and improve consumer, company, and product data, faster on its own, leaving little work for consumers to worry about. This to me Is a bit extreme considering how we all use Amazon but don’t know the background on really what we’re buying and how that data is stored and used for their company through their RDS. This type of data management I think should be more exposed to the public eye that way some reinforcement of better protection of data can occur.

An outside source to help me make further connections was this YouTube tutorial that addresses ASCII and Unicode. It reviews the binary codes and how letters turn into binary numbers. This process helps us understand how to code or decode a letter or phrase on our screen. This is similar to the emoji. Each emoji has its own code and this video does a great job of explaining how the process works that makes it all happen and visible on our screens!

I look forward to next week to further this and my ideas more.

 

Irvine, “Introduction to Data Concepts and Database Systems.”

Unicode Emoji. (n.d.). Retrieved February 20, 2019, from https://www.unicode.org/emoji/

https://aws.amazon.com/rds/?nc2=h_m1(video)

http://mentalfloss.com/article/66338/how-are-emoji-made(article)

https://www.youtube.com/watch?v=5aJKKgSEUnY(extra video)

Internet Design and Google Search Bar

Today we use the word “data” in many different ways, but in the context of computing and information, it is related to structure, as an instance of a general type. Data is something that can be named, classified, sorted, analyzed. In any way that we use this term, we have to keep in mind that it should always be inseparable by the concept of representation. Ar Irvine suggests, by “representation” we understand a computable structure, usually of “tokens” (instances of something representable) corresponding to “types” (categories or classes of representation, roughly corresponding to a symbolic class like text character, text string, number type, matrix/array of number values, etc.). Representation in information and computing contexts defined byte sequences capable of being assigned to digital memory and interpreted by
whatever software layer or process corresponds to the type of representation — text
character(s), types of numbers, binary arrays, etc. Any form of data representation, then, is (must be) computable; anything computable must be represented as a type of
data. This is the essential precondition for anything to be “data” in a computing and
digital information context.

Computer systems, software, algorithms, Internet and Web protocols, and all
forms of data structures are intentionally and necessarily designed in levels that must “communicate” in an overall system design (termed the “architecture” of the system). The Internet and Web are not only network designs for transmitting digital data in standard formats from one address point (node) to another. They also form a networked computer system 
composed with many software, hardware, and networked computing processes on massively distributed servers that are precisely managed in levels or layers:

By Dr. Martin IrvineBy Alan Simpson

While reading about the web, I started thinking about the google search bar and how much data is stored and retrieved when we search for something, and how does the search bar gives the responses back.  In an article by Larry Page and Sergey Brin, the founders of Google, they explain that they build a search engine that used links to determine the importance of the individual pages on the World Wide Web. This engine was first called “The Backrub”. Soon after, it was renamed to Google. But, how does the search bar work? Since we cannot physically see the process that happens behind this web page, we have to de-blackbox it and look at the “hidden” layers.

Google uses a special algorithm to generate search results. It also uses automated programs called spiders or crawlers (which scan Web pages and create indexes of keywords and links from that page to other sites), and has a large index of keywords and where those words can be found. The most important part of the process is the ranking of the results when we search for something, which determines the order that Google displays results. Google uses a trademark algorithm called PageRank, which assigns each page a score, based on factors like the frequency and location of keywords within the Web page, how long the web page has existed, the number of other Web pages that link to the page in question. So, if you want your web page to be higher in the search results, than you need to provide good content so that other people will link back to your page, and the more links your page gets, the higher the PageRank score will be.

I found this interesting video on code academy made by John, Google’s Chief of Search and AI, and Akshaya, from Microsoft Bing, They cover everything how the search bar works, how special programs called “spiders” scan the Internet before you even type in your search terms to what determines which search results show up first.

References:

Berners-Lee, Tim Weaving the Web: The Original Design and Ultimative Destiny of the World Wide Web. New York, NY: Harper Business, 2000. Excerpts.

Michael Buckland, Information and Society (Cambridge, MA: MIT Press, 2017)

“How We Started and Where We Are Today.” Google. Google, n.d. Web. 29 Nov. 2017. <https://www.google.com/intl/en/about/our-story/>.

Review Irvine, “Using the Model of Levels to Understand “Information,” “Data,” and “Meaning”

The Internet: How search works, found at https://www.youtube.com/watch?v=LVV_93mBfSU

Can You Read This? Thank a Data Scientist!

Daniel Keys Moran, an American computer programmer and science fiction writer, once said, “You can have data without information, but you cannot have information without data.” This seems like a fairly straightforward way of distinguishing between data and information, right? Data is everywhere; artificially intelligent machine learning software is embedded in nearly all of our technological devices, monitoring and recording our every digital move, to the point where almost every aspect of our daily activity (even sometimes when we’re offline!) is quantified and turned into data. In the digital realm, information derives from the context that gives meaning to this ever-increasing stockpile of data.

As all of this digital information continues to grow and becomes more diverse and complex, the need arises to better classify and categorize all that data. Dr. Irvine, a professor of Communication, Culture & Technology at Georgetown University, explains the differences between different types of datasets by splitting them into subgroups. He begins by writing, “Any form of data representation is (must be) computable; anything computable must be represented as a type of data. This is the essential precondition for anything to be “data” in a computing and digital information context” (Irvine, 2019, p. 2). According to Irvine (2019), “data” can be seen as:

  • Classified, named, or categorized knowledge representations (tables, charts, graphs, directories, schedules, etc., with or without a software and computational representation)
  • Information structures (represented in units in bits/bytes, such as internet packets)
  • Types of computable structures (text characters & strings, types of numbers, emojis, etc., with standard byte code representations)
  • Structured vs. unstructured data
    • Structured – database categorized and labeled
    • Unstructured – data transmitted in email, texts, social media, etc. that is stored in data services (like “the cloud”)
  • Representable logical and conceptual structures: an ‘object’ with a class/category and various attributes or properties assigned and understood
  • ‘Objects’ in databases, as units of knowledge representation (such as all items in an Amazon category, or the full list of different movies directed by Quentin Tarantino in IMDb)
  • Decomposed into values and distributions in ML nodal algorithms, such as data points in a graph

One subgroup of data I found to be particularly interesting was the types (as in typing on a keyboard) of computable structures, including The International Unicode Standard. As Irvine (2019) writes, “Unicode is the data ‘glue’ for representing the written characters (data type: “string”) of any language by specifying a code range for a language family and standard bytecode definitions for each character in the language” (p. 3). This includes all the letters, numbers, symbols, and accents of different languages, as well as special characters from math and science and even emojis! Each of these minor representations of language and expression has its own specific set of bytes to be rendered before we can make meaning out of it.

According to the Unicode website, there are over 1700 different emojis for a modern digital keyboard (when taking into account the different skin tone variations of each), and each is represented slightly differently across platforms like Google, Facebook, and Twitter. That’s a LOT of data, packaged as information, and stuffed into our sleek and organized emoji libraries before projecting “character shapes to pixel patterns on the specific screens of devices” (Irvine, 2019, p.4).

As you can see, we rely on databases to store, categorize, retrieve, and render information for us in a myriad of ways on a daily basis. From sending emails, to shopping on Amazon, to choosing a show on Netflix, to checking the statistics of your favorite athlete or team, to simply using a smartphone app, databases are always at work, collecting, distributing, and computing information at a clip that’s extremely hard to fathom for the average person.

However, while these databases may seem “artificially intelligent” and autonomous (and many are equipped with machine learning AI algorithms to expedite their processes), they still must be designed, created, coded, managed, and maintained by human computer scientists. In their book Data Science, Kelleher and Tierney (2018) confirm that the total autonomy of these complex databases is a popular myth, saying, “In reality, data science requires skilled human oversight throughout the different stages of the process. Human analysts are needed to frame the problem, to design and prepare the data, to select which ML algorithms are most appropriate, to critically interpret the results of the analysis, and to plan the appropriate action to take based on the insight(s) the analysis has revealed” (p. 34).

So take a moment to appreciate the impressive work that data scientists do, even if most of it is behind the scenes (or behind the screens, if you will). We owe a lot of our digital luxuries to their difficult, meticulous jobs. And for that, I say 👍👏😁.

 

References
Irvine, M. (2019). Distinguishing Kinds and Uses of “Data” in Computing and AI Applications. Retrieved from https://drive.google.com/open?id=1C0zQ9md4WG5VswVdBOCkyw28L39HGZXv
Kelleher, J. D., & Tierney, B. (2018). Data science. Cambridge, Massachusetts: The MIT Press.
Moran, D. K. (n.d.). BrainyQuote. Retrieved from https://www.brainyquote.com/quotes/daniel_keys_moran_230911?src=t_data
The Unicode Consortium. https://unicode.org/

Special Characters in DBMS

Sometimes in conversations with friends or family, words cannot express the message I am trying to convey. When this occurs my first choice is to find an emoji, my second choice is to use a gif. Lately, I have found that it is easier and quicker to find the gif I am looking for when I use an emoji as the search term  in my gif app. Searching for a gif using an emoji is an entirely different process than image search or even facial recognition because these other two search methods rely on  pattern recognition methods, where as unicode search does not.  This process is more efficient for users, because emojis are more concisely able to express reactions or feelings. Often when searching for a gif reaction, users are looking to quickly find a culturally relevant match to their feelings during a conversation. When users need to painstakingly search for the accurate gif, the impact of the gif is lessened by the time it took to find it.

This process relies on the defined meaning of emojis which are controlled by the  Unicode Consortium. The Unicode Consortium organizes and approves the standard bytecode for emoji (Irvine, 2019). The designed definitions of emojis are what keep them relatively standard across operating systems and devices. Despite the fact that the shape or color of the emoji may be  slightly different on facebook versus the iphone, the expression designed on the emoji is the same. This principle is similar to why there are thousands of fonts available on the internet, yet the change in font does not change the actual characters; the majority of  computer fonts use Unicode mapping. (Wikipedia: Unicode Font)

The DBMS that host gifs tend to return hundreds of videos when one particular search term is entered by a user. Emojis may return more accurate results as  search terms for gifs because there is more information designed into an emoji than a word. Words are highly contextual , but emojis have fewer parameters in which they are used, making them a better option for a quick gif search. Using a particular “happy” emoji is more specific than just searching for a “happy” gif. The search could return a smiling happy gif, a laughing gif, or a happy crying gif which may be an accurate result to the search but may not be close to what the user is looking for. This question of using emojis as search terms in DBMS brings up other questions to how special characters can impact the use of DBMS.

Irvine, “Introduction to Data Concepts and Database Systems.”

John D. Kelleher and Brendan Tierney, Data Science (Cambridge, Massachusetts: The MIT Press, 2018).

Unicode as an Illustration of the Meaninglessness of Raw Data

One of the best illustrations of what data means in terms of computation is the Unicode Consortium code. While one might think of a string of characters as data, in computation, this conceptualization of data is already too abstract. The ideal of the letter “K” has no inherent meaning, but it can be represented by data, a concept here meaning a string of bytes and bits that have no intrinsic meaning. The concept of a “K” can be encoded in so many different ways. Before unicode, if I was programming a computer, I could decide that any combination of bits and bytes could encode a representation of a “K.”

The problem with this Wild West approach to encoding is that without a structure of how data should be encoded to represent certain text is that computers interacting with each out might decode data to mean two separate things. For example, one computer could encode 01001010 as “K,” but another computer might have decided that the data of 01001010 means “🦵🏼” which could lead to some interesting mixups when the computers send data to each other to be interpreted. It’s a bit uncomfortable to think that the data and the concept it stores are different things, but that’s the beauty of a general purpose computer storing data.

Enter Unicode. Instead of different programs and layers using different bit representations to encode different character values that might get jumbled or lost in translation, Unicode converged to assign consistent values and identities to fixed byte codes. Unicode includes different language symbols and emoji. All together, Unicode currently codes 137,439 different characters.

via BuzzFeed News

Thus, unicode represents a microcosm of the challenges and solutions that are presented within data storage. Concepts that are familiar to humans based on semiotic knowledge, such as the number 4 or the letter “K” can be encoded with different data combinations because data inherently has no set meaning. Such a situation can be confusing when different encodings clash. Thankfully, we have unicode to convene to determine a universal code of character representation. Now, if only they could make the emoji identical across platforms, all issues of encoding and decoding meaning digitally could be solved. 🤪

References:

Irvine, “Introduction to Data Concepts and Database Systems.”

Tasker, P. (2018, July 17). How Unicode Works: What every developer needs to know about strings and 🦄. Retrieved February 20, 2019, from https://deliciousbrains.com/how-unicode-works/
Unicode and You – BetterExplained. (n.d.). Retrieved February 20, 2019, from https://betterexplained.com/articles/unicode/
Unicode Emoji. (n.d.). Retrieved February 20, 2019, from https://www.unicode.org/emoji/

Cloud Database

Big data are often defined in terms of the three Vs: the extreme volume of data, the variety of the data types, and the velocity at which the data must be processed. (data science) Big data is very valuable to some extent. If we utilize bid data in the right way, it is able to provide us with predictive pattern to help us make better decision and strategy. The key to success is getting the right data and finding the right attributes. (data science)

Because of these traits of big data, it is difficult for both individuals and organizations to keep and process their all data on in-house computer servers. Therefore, we need stronger data management system for us to store and process data–cloud database.

A cloud database is a collection of content, either structured or unstructured, that resides on a private, public or hybrid cloud computing infrastructure platform. The examples of cloud database are Amazon Relational Database, Microsoft Azure SQL Database etc. Actually, cloud computing is very commonplace in our ordinary lives. Most people use many cloud computing applications without realizing they are Gmail, google drive and even our Facebook and Instagram.

Cloud databases can be divided into two broad categories: relational and non-relational. A relational database, typically written in structured query language (SQL), is composed of a set of interrelated tables that are organized into rows and columns. Non-relational databases, sometimes called NoSQL, do not employ a table model. Instead, they store content, regardless of its structure, as a single document, which often used for social media.

For example, I once helped a company manage their CRM database. It is a kind of relational cloud database. I can access customer information via cloud-based CRM software from my computer or while traveling, and can quickly share that information with other authorized parties anywhere and anytime.

The video below shows how one of the cloud relational database–Amazon RDS works:

John D. Kelleher and Brendan Tierney, Data Science (Cambridge, Massachusetts: The MIT Press, 2018).

Michael Buckland, Information and Society (Cambridge, MA: MIT Press, 2017).

Data Science and Amazon Recommendation

Data science focuses on improving decision making through the analysis of data. After collecting many various kinds of data in large amount, different types of patterns can be analyzed and extracted, which helps us identify groups of customers exhibiting similar behaviors and tastes, which helps customer segmentation in business. AI helps when we had a large number of data examples and when data patterns are too complex for humans to discover and extract manually (Kelleher & Tierney, 2018). One important use of data science is in sales and marketing, namely recommendation system.

Judging by Amazon’s success in market, artificial intelligence is increasingly playing an important role in Amazon’s competitive advantage. And in which, two of the best applications of artificial intelligence are including on-site and off-site product recommendations.

Amazon’s recommendation system is based on unsupervised learning, and its aim is to find the regularities and patterns in the input to see what normally happens, therefore find clusters or groups of input which has structure embedded in the data (Alpaydin, 2016). Normally, people do not buy things randomly. Instead, there are certain association rules inside the behaviors and their purchase depend on a number of factors. For example, demographic information. Amazon recommendation system algorithm takes these data as input and group these input with artificial intelligence to make prediction about the existing customers’ later purchase and attract potential customers at the same time. Besides, there are certain hidden factors and if AI can estimate those hidden factors for a customer, it can make more accurate estimation. This is all about data and the invisible pattern between input and output.

For the data mining of Amazon recommendation system, we first need to figure out its input data and data source, which includes purchased shopping cart, items added to carts but abandoned, wish lists, dwell time, referral sites, customers’ demographic information, number of times viewed an item before final purchase, click paths in session, pricing experiments online, etc. These data are so huge, but with artificial intelligence, it can easily find the hidden factors and invisible patterns and generate the “Recommended for You, XXX” section on the website which leads customers to a page full of products recommended just for each individual customer. That is to say, artificial intelligence can create a personalized shopping experience for every customer.

Amazon’s recommendation system can also generate a “Frequently Bought Together” section which is found below every product listing and suggests a combination of complementary products. The focus here is on cross-selling products to increase order size. And this section is quite important because people might need the complementary products rather than contradictory ones, so that the choice of products recommended in this section should be cautious and it’s better to recommend a group of products that customers can buy as a bundle.

Besides, Amazon has a section named “Customers Who Bought This Item Also Bought” section, which is similar to “Frequently Bought Together”. In this section, Amazon display items which have been purchased together in the past to increase the average order values through cross-selling.

Amazon also looks at the products which customers have been browsing and thinks of the reasons why customers see it but not buy it with data. Besides, Amazon might guess customer’s psychological factors and recommend them very similar products of different shapes, sizes, and brands to help them find products which customers might be interested. Also, this process act like visiting the physical stores off-line, where customers can compare the same products of different brands and make wiser purchase decision after comparing immediately.

Reference:

Alpaydin, E. (2016). Machine learning: the new AI. MIT Press.

Boden, M. A. (2016). AI: Its nature and future. Oxford University Press.

Justin, Y. (2017). 5 lessons you can learn from Amazon’s recommendation engine. Retrieved from http://altitudelabs.com/blog/amazon-product-recommendation-engine/

Daily experience with two data systems

Unicode Emoji

I believe most of you have had the experience of sending emojis when chatting with friends which are helpful to better express your feelings, but sometimes your friends receive the emojis that are totally different from what you think they are seeing. For example, if you send a neutral face, faces will vary from platform to platform. As shown in the screenshot, the presentation of a neutral face is significantly different in Android, Microsoft, Apple and Samsung emoji systems. In some extreme cases, your friends see nothing but a bunch of unreadable codes like “😔.

How do emojis get lost in translation? Behind the emojis you see on your screens is the Unicode standard. It is a way of representing the written characters of any language by specifying a code range for a language family and standard bytecode definitions for each character in the language (Irvine, 2019). Unicode sets the basic emoji symbols that are available, then Apple, Google, Microsoft, and Samsung draw their own interpretation. That’s why a neutral face looks different on an Android phone than it does on an iPhone.

Database management system

A database system has four components: users, the database application, the database management system (DBMS), and the database. As an important part of a database system, DBMS, a computer program, is used to create, process, and administer the database (Kroenke, 2017).

In a DBMS environment, there are three types of users: application programmers, database administrators, and end users. The application programmers write programs in various programming languages to interact with databases. Database administrators take responsibility of managing the entire DBMS system. The end users interact with the database management system by conducting operations on database like retrieving, updating, deleting, and so on.

DBMS is highly applicable in our daily life. For example, in universities, DBMS is used to manage student information, course registrations, colleges and grades. University employs a database application to keep track of things, so that staff can easily retrieve and update student information on her computer with a software. Application programs read or modify database data by sending SQL statements to the DBMS. The DBMS receives requests and translates those requests into actions on the database (Kroenke, 2017). Then the database that stores a collection of related tables (like Student, Courses, Department, and Deposits) operate actions and send back the data that a university staff needs.

References

Irvine, “Introduction to Data Concepts and Database Systems.”

John D. Kelleher and Brendan Tierney, Data Science (Cambridge, Massachusetts: The MIT Press, 2018).

David M. Kroenke et al., Database Concepts, 8th ed. (New York: Pearson, 2017). Excerpt.

What is DBMS? https://www.guru99.com/what-is-dbms.html