Character Sets and Unicode

2017-11-03 09:05:01

We live on a planet on which many languages are spoken. I can walk out my front door in Brooklyn and hear people conversing in English, French, Creole, Hebrew, Arabic, Spanish, and languages I don't even recognize. The Internet is even more diverse than Brooklyn. A local doctor's office that sets up a storefront on the Web to sell vitamins may soon find itself shipping to customers whose native languages are Chinese, Gujarati, Turkish, German, Portuguese, or something else. There's no such thing as a local business on the Internet.

However, the first computers and the first programming languages were mostly designed by English-speaking programmers in countries where English was the native language. These programmers designed character sets that worked well for English text, though not much else. The preeminent such set is ASCII. Since ASCII is a 7-bit character set, each ASCII character can be represented as a single byte, signed or unsigned. Thus, it's natural for ASCII-based programming languages, such as C, to equate the character data type with the byte data type. In these languages, the same operations that read and write bytes also read and write characters.

Unfortunately, ASCII is inadequate for almost all non-English languages. It contains no cedillas, umlauts, betas, thorns, or any of the other thousands of non-English characters used around the world. Fairly shortly after the development of ASCII there was an explosion of extended character sets, each of which encoded the basic ASCII characters plus the additional characters needed for another language, such as Greek, Turkish, Arabic, Chinese, Japanese, or Russian. Many of these character sets are still used today, and much existing data is encoded in them.

However, these character sets are still inadequate for many needs. For one thing, most assume that you only want to encode English plus one other language. This makes it difficult for a Russian classicist to write a commentary on an ancient Greek text, for example. Furthermore, documents are limited by their character sets. Email sent from Morocco may become illegible in India if the sender is using an Arabic character set but the recipient is using Devanagari.

The Unicode character set is the end result of an ongoing international effort to create a single character set that everyone can use. Unicode supports the characters needed for English, Arabic, Cyrillic, Greek, Devanagari, and many other languages. Unicode isn't perfectthere are some omissions and redundanciesbut it is the most comprehensive character set yet devised for all the languages of planet Earth.

Java adopts Unicode as its native character set. Java chars and strings are Unicode (more specifically, the UTF-16 encoding of the Unicode character set). However, since there's also a lot of non-Unicode legacy text in the world, in a dizzying array of encodings, Java provides classes to read and write text in those encodings as well.

Категории