Abstract
Nowadays the Internet environment has become multilingual so that one standard encoding system that enables the exchange of electronic text is necessary. The Unicode Standard is the basis of software that can function all around the world and it provides the underpinning for the World Wide Web and the global business environment of today. Chinese characters, which belong to Han Ideographs have utilized other encoding systems before the Unicode Standard. However, they have several disadvantages and they are not suitable in today’s multilingual world. The Unicode Standard not only solve these problems, but also help those non-English languages transmit online in the globalized environment.
Keywords: Unicode, multilingual Internet, Han Ideographs, globalization, Chinese characters
1.Introduction
The Unicode Standard is the universal encoding and computing industry standard for written characters and text. Unicode solves the discontinuity of the multilingual Internet. It defines a consistent way of encoding multilingual text that enables the exchange of text data in the multilingual environment and creates the foundation for global software.
Unicode is the basis of the software that can be used and function all around the world and it is required in the new Internet protocols and implemented in all modern operating systems. As the universal standard, Unicode aims to unify many hundreds of conflicting ways to encode characters and replace them with a single and universal standard.
Compared to ASCII, abbreviated from American Standard Code for Information Interchange, Unicode characters are represented in one of three encoding forms: a 32-bit form(UTF-32), a 16-bit form(UTF-16) and an 8-bit form(UTF-8). The Unicode Standard is code-for-code identical with International Standard ISO-IEC 106-46.
The Unicode Standard has many advantages. With Unicode Standard, the information technology industry has replaced proliferating character sets with data stability, global interoperability and data interchange, simplified software and reduced costs. The Unicode character encoding treats alphabetic characters, ideographic characters and symbols equivalently, which means they can be used in any mixture and with equal facility. The universality of the Unicode Standard can also be reflected as it is sufficient not only for modern communication for the world’s language, but also to represent the classical forms of many languages. Also, the Unicode Standard is more efficient and flexible than previous encoding system and the new system would satisfy the needs of technical and multilingual computing and would encode a broad range of characters for all purposes, including worldwide publication.
However, at the same time, the Unicode Standard also has disadvantages. As the Internet was emerging as a global phenomenon, commentators often noted that it appeared to be a primarily English-language domain. It is often argued that while minority language are given an online voice by Unicode, the context is still one of western power. Besides, the Unicode Standard does not encode idiosyncratic, personal, novel or private-use characters, nor does it encode logos or graphics. Consequently, the Unicode Standard continues to respond to new and changing encoding and responds to scholarly needs. To preserve world cultural heritage, important archaic scripts are encoded as consensus about the encoding is developed.
2.The History of Chinese Encoding System
2.1 ASCII and Its Disadvantages for Chinese Characters
When computers store letters, they encode them into numbers which are in the binary form. If another computer wants to put these letters on the screen, it converts the numbers back into letters. The computer does it by consulting a map, which tells it, for example, the code number 97 represents the letter ‘a’. Originally based on the English alphabet, ASCII, which constructed in 7-bit code, encoded 128 specified characters into seven-bit integers as shown by the ASCII chart above. Ninety-five of the encoded characters are printable: these include the digits 0 to 9, lowercase letters a to z, uppercase letters A to Z, and punctuation symbols.
ASCII is plenty enough for writing text in English. However, this caused a problem for language with extra letters, symbols or accents. Therefore, different countries began exploiting their new encoding systems.
2.2 GB2312-80, GBK and GB18030
Chinese, as a non-Latin alphabet, is known as the problem of encoding. Before the existence of the Unicode Standard, there are three encoding standards which are used in different parts of China. The Chinese standard encoding system is called ‘GB2312-80’, which is mainly used in mainland and encodes about 6,763 Chinese simplified characters. The ‘Big5’ encoding system is used in Taiwan and encodes about 8,000 Chinese traditional characters which are used in Taiwan. The ‘HKSCS’ encoding system is used in Hong Kong and it also uses Chinese traditional characters. However, the ‘Big5’ and ‘HKSCS’ are two different encoding systems.
These three encoding systems all utilize and extend ASCII. In these systems, one Chinese character, no matter simplified or traditional, is represented by two ASCII characters. So they are compatible with ASCII. However, the three encoding systems are not compatible so that it is almost impossible to show GB and Big5 in the same system. Therefore, it is impossible to see Chinese simplified and traditional characters in the same screen.
Another one of the problems for GB2312-80 is that there are so few Chinese characters that Chinese Ethnic Minorities’ characters are not included. Moreover, the bigger problem is that Chinese characters don’t have their own encoding system. Most computers have already had an ASCII to store English characters. Consequently, some softwares utilize it to draw symbols. However, when these softwares are applied in Chinese system, some symbols are mistaken for Chinese characters and this could cause trouble. Also, if the sentence combines both Chinese and English characters, the system would be confused whether it should belong to ASCII or GB2312-80.
The GBK character set was defined in 1993 as an extension of GB2312-80, while also including the characters of GB13000.1-93 through the unused codepoints available in GB2312. GBK can be used in operating systems such as Windows and Linux. GB18030 is the superset of GBK and includes more characters based on GBK. GB18030 includes thousand of characters of the Chinese Ethnic Minorities. However, nowadays no operating systems can directly utilized GB18030.
2.3 The Usage of The Unicode Standard
The consensual solution to the problem of encoding has been provided by the Unicode Consortium, whose website declares: ‘Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.’ In other words, the Unicode Standard provides the universal and huge code sheet to include all the scripts and alphabets instead of requiring their own code sheet. The Unicode Standard offers a standardized way of encoding all documents in all language and provides a unified representation for every single character. That is to say, the Unicode Standard solved the above problems of GB2312-80 and provides the universal encoding system for all the Chinese characters, no matter simplified ones or traditional ones, no matter the characters used by Han people or Chinese Ethnic Minorities.
3.The History of Chinese Characters
Chinese characters, unlike the alphabetical language, are formed with no letters or combination of letters to represent the sounds of the Chinese language. Rather, they are symbols constructed and used to convey meanings as well as sounds that indicate meaning(Yin, 2006). According to the Chinese Legend, the Historian of Yellow Emperor called Cangjie created the original Chinese characters according to the shape of sun, moon and footprints of animals, etc. in 2650 BCE.
The history of Chinese characters can be divided into two major periods: ancient writing and modern writing. There are six major writing styles associated with these two periods.
Initially, in the Shang Dynasty(1711-1066 BC), oracle bone script was the form of Chinese characters inscribed on tortoise shells and animal bones.The oracle bone script of the late Shang appears pictographic, as does its contemporary, the Shang writing on bronzes. Later in Zhou Dynasty(1066-256 BC), characters were cast or inscribed on bronze bells and vessels and it was called bronze inscription. Oracle bone script is clearly greatly simplified, and rounded forms are often converted to rectilinear ones; this is thought to be due to the difficulty of engraving the hard, bony surfaces, compared with the ease of writing them in the wet clay of the molds the bronzes were cast from.
Towards the end of the Zhou Dynasty, the Qin State began to utilize bamboo strips and pieces of silk as the medium and create a new script called ‘Seal Script’. After the Qin State conquered the other six states and unified China and established the Qin Dynasty, the Seal Script was decreed as the official standard of the writing for the whole country. At this time, all the characters were roughly square in shape and positioning of characters and complexity of the forms become consistent. Small Seal Script has also been proposed for inclusion in Unicode.
However, the seal scripts were quite time-consuming and cumbersome, so a more concise and easier to write script was needed to save time. Therefore, in the Han Dynasty(206 BC – 220 AD), the ‘Clerical Script’ became the officially approved formal way of writing. The largest change between clerical script and seal script was that clerical script dropped the pictorial appearance of Chinese characters almost completely and established the foundation of the structures for modern Chinese characters.
Since the clerical script, the structure of Chinese characters have not changed. However, the strokes have undergone two major changes: regularization and normalization. From the late Han Dynasty to 1955, Chinese characters strokes were smoother and straighter than those clerical script. The regularized clerical scripts are clearer and easier to read and write and became widespread. They have become used for everyday communication and have been the standard of Chinese writing for more than 1,800 years. In the first three and half decades of the 20th, a special government organization first called the Committee for Chinese Language Reform and later the National Language Commission began to normalize Chinese characters to make them systematic, simplified and standardized. In 1955, to systemize Chinese characters, the ‘List of First Group of Standardized Form of Variant Characters’ was officially published and 1,027 character variants are eliminated. The number of strokes in 2,235 of the characters is systematically reduced. The forms of characters for printing type and the stroke order are standardized and normalized.
From oracle bone script to normalized clerical script, the Chinese characters are changing from visualization to symbolization. The graphics and meanings of Chinese characters correspond to signifier and signified according to Ferdinand De Saussure. For each Chinese character, its graphic could tell its specific meaning, and that’s how oracle bone script was developed initially. There are three forms of relationships between the signifier and signified, symbol/symbolic, icon/iconic, and index/indexical. Initially, the relationship between graphics and meanings of Chinese characters is icon/iconic. However, as time went on, Chinese characters have been so modified and normalized that their meanings became less and less similar to their graphics. Based on previous characters, the standardized and normalized Chinese characters also include abstract culture notion and embody more symbolized relationships between graphics and meanings.
For the form of Chinese characters, Chinese characters are monospaced ad each character takes the same vertical and horizontal space, regardless of how simple or complex its particular form is. This is relevant to the history of Chinese printing and typographical practice. The earliest Chinese printing is called Woodblock Printing invented in Tang Dynasty before 220 AD. Woodblock printing accelerated the transmission of words and knowledge, however, all the words in one page needed to be carved on one woodblock so that one little mistake could cause a big trouble. Based on this, Moveable Type was invented by Bi Sheng in the Song Dynasty and each character was placed in a square cell. For alphabetic scripts, movable-type page setting was quicker than woodblock printing. The metal type pieces were more durable and the lettering was more uniform, leading to typography and fonts. The types of glyphs used to depict characters in the Han ideographic repertoire of the Unicode Standard will provide users with the ability to select the font that is most appropriate for a given locale.
4.The Introduction of Han Ideographs
4.1 What Is Han Ideographs And The Necessity of Han Unification
The Unicode Standard contains a set of unified Han ideographic characters used in the written CJK languages.The term ‘CJK’, which means Chinese, Japan and Korea is used to describe the languages that currently use Han ideographic characters. The term Han, derived from the Chinese Han Dynasty, refers generally to Chinese traditional culture. Traditionally, the script was written vertically from right to left. However, in morden usage, the Han script is written horizontally from left to right. Han ideographs are logo-graphic characters, which means that each character represent a word, not just a sound. The Han characters developed from pictographic and ideographic principles. Also, they can be used phonetically.
The size of the full CJK Unicode character is so big and they are represented by different ideograms may approach or exceed 100,000. Apart from the shape of Chinese characters changed and used in other countries such as Japan and Korea, there are currently two main varieties of written Chinese: ‘simplified Chinese’, which is used in the mainland of China and Singapore, and ‘traditional Chinese’, which is used predominantly in Hong Kong, Macau, Taiwan and other oversea Chinese communities. The interconverting between simplified Chinese and traditional Chinese is a complex process because a single simplified character may correspond to multiple traditional Chinese characters. For example, the simplified character U+53F0 台 corresponds to U+6AAF 檯, U+81FA 臺 and U+98B1 颱.
Moreover, vocabulary differences have arisen between Mandarin as spoken in Mainland China and Taiwan. For example, both 旅游(lǚ yóu) in Mainland China and 观光(guān guāng) in Taiwan mean tourism in English. Consequently, merely converting the character content of a text from simplified Chinese to the appropriate traditional Chinese is insufficient, or vice versa. Traditional to Simplified characters is not a one-to-one relationship. However, the vast majority of Chinese characters are the same in both simplified and traditional Chinese.
The character repertoires of the simplified and traditional Chinese are the same. And the Chinese official encoding standard regulates that each had unique coding. There are two national standards in the mainland of China, GB2312-80 and GB12345-90. The former one is used to represent simplified Chinese while the latter one is used to represent traditional Chinese. Similarly, the Unicode Standard contains a number of distinct simplifications for characters, such as U+8AAC 説(shuō) and U+8BF4 说(shuō). Where the simplified and traditional forms exist as different encoded characters in the Unicode Standard, each should be used as appropriate.
Besides Mandarin, Chinese is a language which has different spoken forms that share a single written form. Those different spoken forms besides Mandarin are called dialect. Some dialects are actually mutually unintelligible and distinct languages. For example, Cantonese which is used in Hong Kong and Macau are different spoken forms from Mandarin, although they share the same written form. Apart from dialects, the standard form of written Chinese which was derived from classical Chinese is called literary Chinese. Although they are not used to speak everyday, they can still be seen in the printed form or online. Based on the complexity of Chinese characters, the ideographic repertoire of the Unicode Standard is sufficient for all but the most specialized texts of modern Chinese, literary Chinese and classical Chinese. For the dialects, the current ideographic repertoire of the Unicode Standard should be adequate for many–but not all–written texts.
4.2 The Unicode Standard Defined How Characters Are Interpreted Based On Context
The difference between identifying a character and rendering it on screen or paper is crucial for understanding the Unicode Standard role in text processing. The character identified by a Unicode code point is an abstract entity. Here it is important to figure out the differences between the notion ‘character’, ‘glyph’ and ‘grapheme’.
A character is the smallest component of written language that has semantic value. It is an abstract concept rather than a particular way of drawing the thing. So, letters are characters, so are numbers, punctuations and many symbols. The mark made on the screen or paper is called a glyph. Glyph is the visual representation of the character. Generally most or all of them are mapped to characters via a table in the font. Grapheme is the smallest abstract unit of meaning in a writing system. A grapheme is anything that functions as a character in a specific languages’ written tradition.
The Unicode Standard does not define glyph images. That is to say, the Unicode Standard defined how characters are interpreted rather than how glyphs are rendered. Of course, there are the certain softwares or hardwares rendering engine of the computer to be responsible for the appearance of the characters on the screen. The Unicode Standard does not specify the precise shape, size or orientation of on-screen characters. Consequently, the successful encoding, processing and interpretation of text requires appropriate definition of useful elements of the text and the basic rules for interpreting text.
For many centuries, written Chinese was accepted as written standard throughout East Asia. The influence of the Chinese characters on other modern East Asian languages is similar to the influence of Latin on other Western languages. However, as time went on, the evolution of character shapes and semantic drift over the centuries has resulted in changes to the original forms and meanings. For example, the Chinese character ‘汤’ (tāng) originally meant ‘hot water’. It now means ‘soup’ in Chinese. However, ‘hot water’ remains the primary meaning in Japanese and Korean, whereas ‘soup’ appears in more recent borrowings from Chinese, such as ‘soup noodles’. Still, the identical appearance and similarities in meaning are dramatic and more than justify the concept of a unified Han script that transcends language.
There is some concern that different meanings of the same character used in different countries will lead to confusion. However, computationally, Han characters are often combined to ‘spell’ words and their encoding process depends on the context. It is neither practical nor productive to encode each character separately. There are two reasons to explain it.
First, Han characters’ meaning may not be evident from the constituent characters. Instead, they need to combine characters to explain words. For example, the character ‘矛’(máo) means spear
and the character ‘盾’(dùn) means shield. However, the compound ‘矛盾’(máo dùn)means confliction in Chinese(see Figure 4-1).
Figure 4-1. Han Spelling
Second, the computer requires context to distinguish the meanings of the words represented by coded characters. One word may have different meanings in different context. For example, the word ‘杜鹃’(dù juān)may refer to Rhododendron, which is a kind of plant or Cuckoo, which is a kind of bird depending on its context(see Figure 4-2).
Figure 4-2. Semantic Context for Han Characters
4.3 The Rationales of Han Unification
Han unification is an effort to map multiple character sets of CJK languages into a single set of unified characters. The same Han root character may have different visual representation in Traditional Chinese, Simplified Chinese, Japanese and Korean. For example, the first stroke of ‘户‘ (hù) has three different visual representation. These three characters with different visual representation can be unified with the same code since they share the same root character.
So, one important necessity of Han unification is the desire to limit the size of the Unicode character set. However, before Unicode Standard, different countries use different encoding systems and these encoding systems are not compatible with each other. Characters which evolved from the same root character cannot correspond with each other. Consequently, the Unicode Standard is responsible to solve this problem.
According to the Unicode Consortium, the rationale of Han Unification is Source Separation Rule. If two ideographs are distinct in a primary source standard, then they are not unified. That is to say, the Unicode separate characters in different code whenever the abstract meaning changes. For Han Unification, the characters are not unified by their appearance, but by their definition or meaning. Also, in general, if two ideographs are unrelated in historical derivation, then they are not unified. For example, ‘日‘ and ‘曰’ have two different codes because they are historically unrelated, although they might look similar.
To deal with the use of different graphemes for the same Han unification sememe, Unicode has relied on several mechanisms. First is that to treat it as simply a font issue so that different fonts might be used to render Chinese, Japanese or Korean depending on the users’ environment settings to determine which glyph to use. However, this might cause confusion in the multilingual text. The second mechanism is that Unicode added the concept of variation selectors which are treated as combining characters with no associated diacritic or mark. Instead, by combining with a base character, they signal the two character sequence selects a grapheme variation or a variation of the base abstract character. Such two-character sequence can be mapped to a separate single glyph easily. Since the Unicode Standard has assigned 256 separate variation selectors, it can assign 256 variations for any Han ideograph and it is sufficient for variations to be specific to one language or another and enable the encoding the plain text that includes such grapheme variations.
Han unification has caused considerable controversy, particularly among the Japanese public, who, with the nation’s literati, have a long history of protesting the culling of historically and culturally significant variants. This is because Small differences in graphical representation are also problematic when they affect legibility or belong to the wrong cultural tradition. The widespread use of Unicode would make it difficult to preserve small distinction. Much of the controversy surrounding Han unification is based on the distinction between glyphs, as defined in Unicode, and the related but distinct idea of graphemes. Unicode assigns abstract characters(graphemes), as opposed to glyphs, which are a particular visual representations of a character in a specific typeface.
4.4 CJK Unified Ideographs Blocks
The Han script includes 87,882 unified ideographic characters defined by national, international and industry standards of China, Japan, Korea, Vietnam and Singapore. Because of the large size of the Han ideographic character repertoire, and because of the particular problems that the characters pose for standardizing their coding, this character block description is more extended than that for other scripts and is divided into several subsection. The block is the result of the Han unification.
Table 4-1. Blocks Containing Han Ideographs
Block | Range | Comment |
CJK Unified Ideographs | 4E00-9FFF | Common |
CJK Unified Ideographs Extension A | 3400-4DBF | Rare |
CJK Unified Ideographs Extension B | 20000-2A6DF | Rare, historic |
CJK Unified Ideographs Extension C | 2A700-2B73F | Rare, historic |
CJK Unified Ideographs Extension D | 2B740-2B81F | Uncommon, some in current use |
CJK Unified Ideographs Extension E | 2B820-2CEAF | Rare, historic |
CJK Unified Ideographs Extension F | 2CEB0-2EBE0 | Rare, historic |
CJK Compatibility Ideographs | F900-FAFF | Duplicates, unifable variants, corporate characters |
CJK Compatibility Ideographs Supplement | 2F800-2FA1F | Unifiable variants |
Conclusion
The Unicode Standard has many advantages compared to previous encoding systems and it plays an important role in the globalized environment. The Unicode Standard takes the history of Chinese characters into consideration and contains a set of unified Han ideographic characters used in the written Chinese, Japanese and Korean languages. Because of the large size of the Han ideographic character repertoire, the Han ideograph is divided into several blocks according to the rule of the Han unification. Consequently, the Unicode Standard and the Han ideographs help a lot in the communication of Chinese culture in the multilingual and globalized environment.
Bibliography
- Allen, J. D., Anderson, D., Becker, J., Cook, R., Davis, M., Edberg, P., … & Jenkins, J. H. (2012). The Unicode Standard(Vol. 6). Version.
- Bates, E. (2014). The emergence of symbols: Cognition and communication in infancy. Academic Press.
- Cheng, C. C. (1973). A synchronic phonology of Mandarin Chinese (Vol. 4). Walter de Gruyter.
- Culler, J. D. (1986). Ferdinand de Saussure. Cornell University Press.
- Gillam, R. (2002). Unicode demystified: a practical programmer’s guide to the encoding standard. Addison-Wesley Longman Publishing Co., Inc..
- Hardie, A. (2007). From legacy encodings to unicode: The graphical and logical principles in the scripts of south asia. Language Resources and Evaluation, 41(1), 1-25. doi:10.1007/s10579-006-9003-7
- John, N. A. (2013). The construction of the multilingual internet: Unicode, Hebrew, and globalization. Journal of Computer‐Mediated Communication, 18(3), 321-338.
- Unicode Consortium. (1997). The Unicode Standard, Version 2.0. Addison-Wesley Longman Publishing Co., Inc..
- Unicode Staff, C. O. R. P. O. R. A. T. E. (1991). The Unicode Standard: Worldwide Character Encoding. Addison-Wesley Longman Publishing Co., Inc..
- Yin, J. J., 1955. (2006). Fundamentals of chinese characters: Han zi ji chu / yin jinghua. New Haven: Yale University Press.