Strings and chars

Because of the difficulties caused by different character sets, reading and writing text is one of the trickiest things you can do with streams. Most of the time, text should be handled with readers and writers, a subject we'll take up in Chapter 20. However, the DataInputStream and DataOutputStream classes do provide methods a Java program can use to read and write text that another Java program will understand. The text format used is a modified form of Unicode's UTF-8 encoding. It's unlikely that other, non-Java programs will understand this format.

This variant form of UTF-8 is intended for string literals embedded in compiled byte code and serialized Java objects and for communication between two Java programs. It is not intended for reading and writing arbitrary UTF-8 text. To read standard UTF-8, you should use an InputStreamReader; to write it, you should use an OutputStreamWriter.

8.6.1. Writing Text

The DataOutputStream class has four methods that convert text into bytes and write them onto the underlying stream:

public final void writeChar(int c) throwsIOException public final void writeChars(String s) throws IOException public final void writeBytes(String s) throws IOException public final void writeUTF(String s) throws IOException

The writeChar( ) method writes a single Java char. This method does not use UTF-8. It simply writes the two bytes of the char (i.e., a UTF-16 code point) in big-endian order. writeChars( ) writes each character in the String argument to the underlying output stream as a 2-byte char. And the writeBytes( ) method writes the low-order byte of each character in the String argument to the underlying output stream. Any information in the high-order byte is lost. In other words, it assumes the string contains only characters whose value is between 0 and 255.

The writeUTF( ) method, however, retains the information in the high-order byte as well as the length of the string. First it writes the number of characters in the string onto the underlying output stream as a 2-byte unsigned int between 0 and 65,535. Next it encodes the string in UTF-8 and writes the bytes of the encoded string to the underlying output stream. This allows a data input stream reading those bytes to completely reconstruct the string. However, if you pass a string longer than 65,535 characters to writeUTF( ), writeUTF( ) tHRows a java.io.UTFDataFormatException, which is a subclass of IOException, and doesn't write any of the data. For large blocks of text, you should use a Writer rather than a DataOutputStream. DataOutputStream is intended for files containing mixed binary and text data, not for those comprised purely of text content, such as XML documents.

8.6.2. Reading Text

The DataInputStream class has three methods to read text data:

public final char readChar( ) throws IOException public final String readUTF( ) throws IOException public static final String readUTF(DataInput in) throws IOException

The readChar( ) method reads two bytes from the underlying input stream and interprets them as a big-endian Java char. It throws an IOException if the underlying input stream's read( ) method throws an IOException. It throws an EOFException if there's only one byte left in the stream and therefore a complete char can't be read.

The no-args readUTF( ) method reads the length of the string and then reads and returns a string that was written in Java's pseudo-UTF-8 encoding with a 2-byte, unsigned length prefix (in other words, a string written by writeUTF( ) in DataOutputStream). This method throws an EOFException if the stream runs out of data before providing the promised number of characters. It throws a UTFDataFormatException if the bytes read are not valid UTF-8for example, if 4 bytes in a row begin with the bit sequence 10. And, of course, it will propagate any IOException tHRown by the underlying stream.

Finally, the static readUTF( ) method reads a UTF string from any DataInput object. It also expects Java's pseudo-UTF-8 format and is not suitable for general purpose text reading.

8.6.3. The Deprecated readLine( ) Method

The DataInputStream class also has a commonly used but deprecated readLine( ) method:

public final String readLine( ) throws IOException

This method reads a single line of text from the underlying input stream and returns it as a string. A line of text is considered to be any number of characters, followed by a carriage return, a linefeed, or a carriage return/linefeed pair. The line terminator (possibly including both a carriage return and a linefeed) is read; however, it is not included in the string returned by readLine( ). The problem with readLine( ) is that it does not properly handle non-Latin-1 character sets. BufferedReader's readLine( ) method is supposed to be used instead. readLine( ) also has a nasty bug involving streams that end with carriage returns that can cause a program to hang indefinitely when reading data from a network connection.

Категории