UTF-8 The American Standard Code for Information Interchange (ASCII) has long been the standard way most computers represent text in their memories and on their disks. Each letter, digit, and symbol is represented by a different seven-bit binary code. Rather than write the codes in binary, they are usually written in numerical form, as numbers from 0 to 127. For example, the code for A is 65. The problem is that while 128 values are enough to represent English uppercase letters, lowercase letters, digits, and punctuation, they're not enough to represent all the accented characters used in European languages, and definitely not enough to represent all the characters used in Japanese, Hebrew, Indian languages, and so on. A new standard for representing text called the Universal Character Set (UCS), or Unicode, solves this by having literally millions of possible codes. The problem is that to work directly with Unicode data, software that manipulates text has to be rewritten, which takes a long time. Also, Unicode is less efficient than ASCII for English words, making it slow to gain popularity among many English speakers who don't see much benefit from having all those extra characters if they're not using them. The word "Hello" in Unicode can take double or even four times the space to store as the same word stored using ASCII. UCS Transformation Format 8 (UTF-8) solves this problem in a simple, elegant way. A single eight-bit byte in computer memory can hold 256 possible values, from 0 to 255, but ASCII requires only 128 values, leaving the other 128 unused. The ingenious solution is that UTF-8 uses the values 0-127 to represent exactly the same characters as ASCIIso the word "Hello" stored using UTF-8 and the word "Hello" stored using ASCII are exactly the same in memory. UTF-8 is a compatible superset of ASCII. So we can declare, by fiat, as it were, that every single ASCII string stored in memory or on disk anywhere in the world is actually a UTF-8 string, and not a single line of software has had to change. The second step is how UTF-8 represents all those additional non-roman characters. UTF-8 uses those byte values in the range 128-255, unused by ASCII, to represent those characters. Depending on the character, it may be represented in memory as a consecutive sequence of two, three, or more bytes in the 128-255 range. The beauty of this is that almost all software that works with ASCII text can work without modification with the new UTF-8 text. Of course, if you want to display UTF-8 text on the screen or print it on paper, you need software that knows how to properly decode UTF-8 and draw the right characters, but most software never needs to do this. DNS code is concerned with putting data into packets and reading it out, not with what those characters look like to humans. The little bit of user-interface code responsible for showing text on the screen has to draw UTF-8 characters correctly, but the rest of the DNS protocol codethe vast bulk of itcan just pass the data around as raw data, unconcerned with how that data might eventually be presented to the human user. UTF-8 is popular in the United States, because it allows non-roman characters to be represented using multibyte sequences in otherwise standard ASCII files. In some places outside the USA, where most characters need to be represented using multibyte sequences, UTF-8 is less popular, and many people prefer to use 16-bit Unicode characters (UTF-16) directly. Multicast DNS adopts UTF-8 as the best way to maintain compatibility with existing ASCII names, while at the same time providing the capability to represent non-roman characters, too. |