UTF-16

2017-11-03 09:05:01

The integers to which Unicode maps characters can be encoded in a variety of ways. The simplest approach is to write each integer as a normal big-endian 4-byte int. This encoding scheme is called UCS-4. However, it's rather inefficient because the vast majority of characters seen in practice have code points less than 65,535, and in English text most are less than 127.

In practice, most Unicode text is encoded in either UTF-16 or UTF-8. UTF-16 uses two bytes for characters with code points less than or equal to 65,535 and four bytes for characters with code points greater than 65,535. It comes in both big-endian and little-endian formats. The endianness is normally indicated by an initial byte order mark. That is, the first character in the file is the zero-width nonbreaking space, code point 65,279. In big-endian UTF-16, this is the two bytes 0xFEFF (in hexadecimal). In little-endian UTF-16, this is the reverse, 0xFFFE.

UTF-16 encodes characters with code points from 0 to 65,535 (the Basic Multilingual Plane, or BMP for short) as 2-byte unsigned ints. Characters from beyond the BMP are encoded as surrogate pairs made up of four bytes: first a high surrogate, then a low surrogate. The Java char data type is really a big-endian UTF-16 code point, not a Unicode character, though the difference is significant only for characters from outside the BMP.

To see how this works, consider a character from outside the BMP in a typical UCS-4 (4-byte) big-endian representation. This is composed of four bytes of eight bits each. I will label the bits as x0 through x31:

x31

x30

x29

x28

x27

x26

x25

x24

x23

x22

x21

x20

x19

x18

x17

x16

x15

x14

x13

x12

x11

x10

In reality, the high-order byte is always 0. The first three bits of the second byte are also always 0, so these don't need to be encoded. Only bits x0 through x20 need to be encoded. These are encoded in four bytes, like this:

x15

x14

x13

x12

x11

x10

Here, w1w2w3w4 is the 4-byte number formed by subtracting 1 from the 5-bit number x20x19x18x17x16. There are simple, efficient algorithms for breaking up non-BMP characters into these surrogate pairs and recomposing them. Most of the time you'll let the Reader and Writer classes do this for you automatically. The main thing you need to remember is that a Java char is really a UTF-16 code point, and while 99% of the time this is the same as one Unicode character, there are cases where it takes two chars to make a single character.

Категории