Unicode system is a coding table which links all the written characters of any language to binary codes one by one. But these specific codes cannot be stored and expressed as fonts or characters of language if there is no encoding method to bridge the computer and the Unicode. The problem of black squares and gibberish in text file is very common when it comes to Chinese words or Japanese words. For example, when I downloaded a Japanese video game, it usually has a introduction and description of the games written in txt file. But in most cases, the txt file which should be in Japanese will show black squares and meaningless gibberish. That’s because I do not have the specific transformation for Japanese or my default encoding method does not fit the txt file.
The Unicode system can be managed in levels:
First, application levels. The data in this level is the character shapes to pixel patterns on the digital screens. The data we input with peripherals like keyboards will be translated by Unicode system and finally showed on the screens as string.
Second, logic and language levels. The data is the characters and corresponding codes. The Unicode is a bridge links characters and binary codes. Every specific characters will have their own binary codes one by one. The text input by computer programs can be translated to its own codes and then translated again into representations of the computer.
Third, physical level. The data is bit units stored in disk or RAM. The Unicode codes will be translated into specific bit units by encoding method like Utf-8 so that they can be stored in the computer. Actually, the characters and the bit units is not in one-to-one correspondence. Different characters can be stored in different bit units with different encoding method.
In short, characters(users) ⇌ Unicode ⇌ encoding method(UTF-8) ⇌ bytes ⇌ disk, network, RAM
Refers to the DBMS, I do not have experience of it, but I can try to explain the system in levels:
First, application level. Data is what we see and input through the applications which are designed for the users and ask users to input specific format data through a data entry form. This kind of data would be transmitted to the DBMS and then be translated by DBMS into data packets to the database. The packets from database to the DMBS will also be translated into the information people can understand directly as a result and then shown on the applications. The result might be a data form or some specific error signs like wrong input.
Second, system level. Data is SQL statements and code structures of computers. The parser and grant checking will check the SQI statements from application. Then the data will be transmitted to semantic analysis and query treatment for understanding and classification. The access management, concurrency control and recovery mechanism will work according to the types of data and send the instruction to the database.
Third, physical level. Data is bit units. The system distinguishes different bit units in representations, identifies the access types, and find the matched stored placed in database according to the data types and system file organizations.
First, metadata is considered as the data about data. I am confused about the statement. What’s the difference between metadata and the data describing the rules how to use data in the system? Does the latter belong to the former?
Second, in the reading, I learn about how the Unicode system showed the characters on the screen. But I find that the example in the introduction to data and database that it uses 6 characters to constitute a character on the screen. So can I consider that the characters in language are not one by one corresponded to the Unicode?
Buckland, M. K. (2017). Information and society. The MIT Press.
Irvine, M. (n.d.). CCTP-607 Universes of Data: Distinguishing Kinds and Uses of “Data” in Computing and AI Applications. 9.
Kroenke, D. M., Auer, D. J., Vandenberg, S. L., & Yoder, R. C. (2017). Database concepts (Eighth edition). Pearson.