UTF-16
The integers to which Unicode maps characters can be encoded in a variety of ways. The simplest approach is to write each integer as a normal big-endian 4-byte int. This encoding scheme is called UCS-4. However, it's rather inefficient because the vast majority of characters seen in practice have code points less than 65,535, and in English text most are less than 127.
In practice, most Unicode text is encoded in either UTF-16 or UTF-8. UTF-16 uses two bytes for characters with code points less than or equal to 65,535 and four bytes for characters with code points greater than 65,535. It comes in both big-endian and little-endian formats. The endianness is normally indicated by an initial byte order mark. That is, the first character in the file is the zero-width nonbreaking space, code point 65,279. In big-endian UTF-16, this is the two bytes 0xFEFF (in hexadecimal). In little-endian UTF-16, this is the reverse, 0xFFFE.
UTF-16 encodes characters with code points from 0 to 65,535 (the Basic Multilingual Plane, or BMP for short) as 2-byte unsigned ints. Characters from beyond the BMP are encoded as surrogate pairs made up of four bytes: first a high surrogate, then a low surrogate. The Java char data type is really a big-endian UTF-16 code point, not a Unicode character, though the difference is significant only for characters from outside the BMP.
To see how this works, consider a character from outside the BMP in a typical UCS-4 (4-byte) big-endian representation. This is composed of four bytes of eight bits each. I will label the bits as x0 through x31:
x31
x30
x29
x28
x27
x26
x25
x24
x23
x22
x21
x20
x19
x18
x17
x16
x15
x14
x13
x12
x11
x10
x9
x8
x7
x6
x5
x4
x3
x2
x1
x0
In reality, the high-order byte is always 0. The first three bits of the second byte are also always 0, so these don't need to be encoded. Only bits x0 through x20 need to be encoded. These are encoded in four bytes, like this:
1
1
0
1
1
0
w1
w2
w3
w4
x15
x14
x13
x12
x11
x10
1
1
0
1
1
1
x9
x8
x7
x6
x5
x4
x3
x2
x1
x0
Here, w1w2w3w4 is the 4-byte number formed by subtracting 1 from the 5-bit number x20x19x18x17x16. There are simple, efficient algorithms for breaking up non-BMP characters into these surrogate pairs and recomposing them. Most of the time you'll let the Reader and Writer classes do this for you automatically. The main thing you need to remember is that a Java char is really a UTF-16 code point, and while 99% of the time this is the same as one Unicode character, there are cases where it takes two chars to make a single character.