Checksums
Compressed files are especially susceptible to corruption. While changing a bit from 0 to 1 or vice versa in a text file generally affects only a single character, changing a single bit in a compressed file often makes the entire file unreadable. Therefore, it's customary to store a checksum with the compressed file so that the recipient can verify that the file is intact. The zip format does this automatically, but you may wish to use manual checksums in other circumstances as well.
There are many different checksum schemes. A particularly simple example adds a parity bit to the data, typically 1 if the number of 1 bits is odd, 0 if the number of 1 bits is even. This checksum can be calculated by summing up the number of 1 bits and taking the remainder when that sum is divided by two. However, this scheme isn't very robust. It can detect single-bit errors, but in the face of bursts of errors, as often occur in transmissions over modems and other noisy connections, there's a 50-50 chance that corrupt data will be reported as correct.
Better checksum schemes use more bits. For example, a 16-bit checksum could sum up the number of 1 bits and take the remainder modulo 65,536. This means that in the face of completely random data, there's only 1 in 65,536 chances of corrupt data being reported as correct. This chance drops exponentially as the number of bits in the checksum increases. More mathematically sophisticated schemes can reduce the likelihood of a false positive even further.
The java.util.zip.Checksum interface defines four methods for calculating a checksum for a sequence of bytes. Implementations of this interface provide specific checksum algorithms.
public abstract void update(int b) public abstract void update(byte[] data, int offset, int length) public abstract long getValue( ) public abstract void reset( )
The update( ) methods calculate the initial checksum and update the checksum as more bytes are added to the sequence. As bytes increase, the checksum changes. For example, using the parity checksum algorithm described earlier, if the byte 255 (binary 11111111) were added to the sequence, the checksum would not change because an even number of 1 bits had been added. If the byte 7 (binary 00000111) were added to the sequence, the checksum's value would flip (from 1 to 0 or 0 to 1) because an odd number of ones had been added to the sequence.
The getValue( ) method returns the current value of the checksum. The reset( ) method returns the checksum to its initial value. Example 10-12 shows about the simplest checksum class imaginableone that implements the parity algorithm described here.
Example 10-12. The parity checksum
import java.util.zip.*; public class ParityChecksum implements Checksum { private long checksum = 0; public void update(int b) { int numOneBits = 0; for (int i = 1; i < 256; i *= 2) { if ((b & i) != 0) numOneBits++; } checksum = (checksum + numOneBits) % 2; } public void update(byte data[], int offset, int length) { for (int i = offset; i < offset+length; i++) { this.update(data[i]); } } public long getValue( ) { return checksum; } public void reset( ) { checksum = 0; } } |
The java.util.zip package provides two concrete implementations of the Checksum interface, CRC32 and Adler32. Both produce 32-bit checksums. The Adler-32 algorithm is not quite as reliable as CRC-32 but can be computed much faster. Both of these classes have a single no-argument constructor:
public CRC32( ) public Adler32( )
They share the same five methods, four implementing the methods of the Checksum interface and one additional update( ) method that reads an entire byte array:
public void update(int b) public void update(byte[] data, int offset, int length) public void update(byte[] data) public void reset( ) public long getValue( )
Example 10-13, FileSummer, is a simple program that calculates and prints a CRC-32 checksum for any file. However, it's structured such that the static getCRC32( ) method can calculate a CRC-32 checksum for any stream.
Example 10-13. FileSummer
import java.io.*; import java.util.zip.*; public class FileSummer { public static void main(String[] args) throws IOException { FileInputStream fin = new FileInputStream(args[0]); System.out.println(args[0] + ": " + getCRC32(fin)); fin.close( ); } public static long getCRC32(InputStream in) throws IOException { Checksum cs = new CRC32( ); // It would be more efficient to read chunks of data // at a time, but this is simpler and easier to understand. for (int b = in.read(); b != -1; b = in.read( )) { cs.update(b); } return cs.getValue( ); } } |
This isn't as useful as it might appear at first. Most of the time, you don't want to read the entire stream just to calculate a checksum. Instead, you want to look at the bytes of the stream as they go past on their way to some other, ultimate destination. You neither want to alter the bytes nor consume them. The CheckedInputStream and CheckedOutputStream filters allow you to do this.
10.4.1. Checked Streams
The java.util.zip.CheckedInputStream and java.util.zip.CheckedOutputStream classes keep a checksum of the data they've read or written.
public class CheckedInputStream extends FilterInputStream public class CheckedOutputStream extends FilterOutputStream
These are filter streams, so they're constructed from an underlying stream and an object that implements the Checksum interface.
public CheckedInputStream(InputStream in, Checksum cksum) public CheckedOutputStream(OutputStream out, Checksum cksum)
For example:
FileInputStream fin = new FileInputStream("/etc/passwd"); Checksum cksum = new CRC32( ); CheckedInputStream cin = new CheckedInputStream(fin, cksum);
The CheckedInputStream and CheckedOutputStream classes have all the usual read( ), write( ), and other methods you expect in a stream class. Externally, these methods behave exactly like those in the superclass and do not require any special treatment.
Both CheckedOutputStream and CheckedInputStream have a getChecksum( ) method that returns the Checksum object for the stream. You can use this Checksum object to get the current value of the checksum for the stream.
public Checksum getChecksum( )
These methods return a reference to the actual Checksum object that's being used to calculate the checksum. It is not copied first. Thus, if a separate thread is accessing this stream, the value in the checksum may change while you're working with the Checksum object. Conversely, if you invoke one of this Checksum object's update( ) methods, it affects the value of the checksum for the stream as well.