UTF-8

2017-11-03 09:05:01

UTF-8 is the preferred encoding of Unicode for most scenarios that don't require fast random indexing into a string. It has a number of nice characteristics, including robustness and compactness compared to other Unicode encodings.

UTF-8 encodes the ASCII characters in a single byte, characters between 128 and 2,047 in two bytes, other characters in the BMP in three bytes, and characters from outside the BMP in four bytes. Java .class files use UTF-8 to store string literals, identifiers, and other text data in compiled byte code.

To better understand UTF-8, consider a typical Unicode character from the Basic Multilingual Plane as a sequence of 16 bits:

x15

x14

x13

x12

x11

x10

Each ASCII character (each character between 0 and 127) has its upper nine bits equal to 0:

Therefore, it's easy to encode an ASCII character as a single byte. Just drop the high-order byte:

Now consider characters between 128 and 2,047. These all have their top five bits equal to 0, as shown here:

x10

These characters are encoded into two bytes, but not in the most obvious fashion. The 11 significant bits of the character are broken up like this:

x10

Neither of the bytes that make up this number begins with a 0 bit. Thus, you can distinguish between bytes that are part of a 2-byte character and bytes that represent 1-byte characters (which all begin with 0).

The remaining characters in the BMP have values between 2,048 and 65,535. Any or all of the bits in these characters may take on the value of either 0 or 1. Thus, they are encoded in three bytes, like this:

x15

x14

x13

x12

x11

x10

Within this scheme, any byte beginning with a 0 bit must be a 1-byte ASCII character between 1 and 127. Any byte beginning with the three bits 110 must be the first byte of a 2-byte character. Any byte beginning with the four bits 1110 must be the first byte of a 3-byte character. Finally, any byte beginning with the two bits 10 must be the second or third byte of a multibyte character.

The DataOutputStream class provides a writeUTF( ) method that encodes a string in a slight variation of UTF-8. It first writes the number of encoded bytes in the string (as an unsigned short), followed by the UTF-8-encoded format of the string:

public final void writeUTF(String s) throws IOException

The DataInputStream class provides two corresponding readUTF( ) methods to read such a string from its underlying input stream:

public final String readUTF( ) throws IOException public static final String readUTF(DataInput in) throws IOException

Each of these first reads a 2-byte unsigned short that tells it how many more bytes to read. These bytes are then read and decoded into a Java Unicode string. An EOFException is thrown if the stream ends before all the expected bytes have been read. If the bytes read cannot be interpreted as a valid UTF-8 string, a UTFDataFormatException is thrown.

However, DataInputStream and DataOutputStream diverge from the official UTF-8 format in one respect: they encode the null character (0x00) in two bytes rather than one. This makes it slightly easier for C code that expects null-terminated strings to parse Java .class files. On the other hand, it makes the data written by writeUTF( ) incompatible with most other libraries. The Reader and Writer classes discussed in the next chapter read and write true UTF-8 with 1-byte nulls, and these should be preferred for almost all use cases other than parsing Java byte code.

Категории