The Unicode Character Set

The Unicode character set maps characters to integer code points. For instance, the Latin letter A is assigned the code point 65. The Greek letter S is assigned the code point 931. The musical symbol is assigned the code point 119,074. Unicode has room for over one million characters, which is enough to hold every character from all the world's scripts. The current version of Unicode (4.1) defines 97,655 different characters from many languages, including English, Russian, Arabic, Hebrew, Greek, Korean, Chinese, Japanese, and Sanskrit.

The first 128 Unicode characters (characters 0 through 127) are identical to the ASCII character set. The ASCII space is 32; therefore, 32 is the Unicode space. The ASCII exclamation point is 33, so 33 is the Unicode exclamation point, and so on. Table A-1 in Appendix A shows this character set. The next 128 Unicode characters (characters 128 through 255) have the same values as the equivalent characters in the Latin-1 character set defined by ISO standard 8859-1. Latin-1, a slight variation of which is used by Windows, adds the various accented characters, umlauts, cedillas, upside-down question marks, and other characters needed to write text in most Western European languages. Table A-2 shows these characters. The first 128 characters in Latin-1 are identical to the ASCII character set.

Unicode is divided into blocks. For example, characters 0 through 127 are the Basic Latin block and contain ASCII. Characters 128 through 255 are the Latin Extended-A block and contain the upper 128 characters of the Latin-1 character set. Characters 9984 through 10,175 are the Dingbats block and contain the characters in the popular Zapf Dingbats font. Characters 19,968 through 40,959 are the unified Chinese-Japanese-Korean ideograph block.

For complete lists of all the Unicode characters and associated glyphs, the canonical reference is The Unicode Standard, Version 4.0 by the Unicode Consortium (ISBN 0-321-18578-1). Online versions of the character tables can be found at http://unicode.org/charts/.

Although internally Java can handle full Unicode data (code points are just numbers, after all), not all Java environments can display all Unicode characters. The biggest problem is the lack of fonts. Few computers have fonts for all the scripts Java supports. Even computers that possess the necessary fonts can't install a lot of them because of their size. A normal, 8-bit outline font ranges from about 3060K. A Unicode font that omits the Han ideographs will be about 10 times that size. A Unicode font that includes the full range of Han ideographs will occupy between 5 and 7 MB. Furthermore, text display algorithms based on English often break down when faced with right-to-left languages like Hebrew and Arabic, vertical languages like the traditional Chinese used in Taiwan, or context-sensitive languages like Arabic.

Категории