Encoding Heuristics

There's no 100% guaranteed way to determine the encoding of an arbitrary file or stream. However, you can make some reasonable guesses that are likely to be more correct than not.

Unicode files are especially easy to detect because most such files begin with a byte order mark. This is the Unicode zero-width nonbreaking space character that has code point 0xFEFF. The byte-swapped character 0xFFFE is never a legal Unicode character. Furthermore the single bytes 0xFF and 0XFE are uncommon in most single-byte encodings like Latin-1 and MacRoman, unlikely to occur in sequence, and unlikely to occur at the beginning of a stream. Therefore, a stream that begins with the two bytes 0XFF and 0xFE in that order is almost certainly encoded in big-endian UTF-16. A stream that starts with the opposite order (0XFE and 0xFF) is almost certainly encoded in little-endian UTF-16.

In UTF-8, the zero-width nonbreaking space is represented by the three bytes 0xEF 0xBB 0xBF, always in that order. Thus, any file that begins with these three bytes is almost certainly UTF-8. However, not all UTF-8 files begin with a byte order mark, but UTF-8 is a very picky standard, and it's unlikely that any non-UTF-8 file will accidentally parse correctly as UTF-8. If you think a file might be UTF-8, try reading a few hundred characters as UTF-8. If there are no exceptions, chances are very good it is UTF-8.

Some other encodings of Unicode such as UTF-32 can be also be detected by inspecting the byte order mark. However, these are mostly of theoretical interest. I've never encountered one in the wild.

If a file isn't Unicode, life is tougher. Most single-byte character sets are supersets of ASCII, so even if you guess wrong, the majority of the text is likely to come through unchanged. Latin-1 misread as MacRoman or vice versa isn't pretty. However, it is intelligible in most cases.

If you have some idea of the file type, there may be other ways to guess the encoding. For instance, all XML documents that are not written in Unicode must begin with an XML declaration that includes an encoding declaration:

Other than Unicode and EBCDIC, most character sets are supersets of ASCII so you can assume the encoding is ASCII, read far enough in the stream to find the encoding declaration, then back up and reread the document with the correct encoding. To detect Unicode, look for a byte order mark. To detect EBCDIC-encoded XML, look for the initial four bytes 0x4C 0x6F 0xA7 0x94 in that order. This is "

HTML is similar. You treat the file as ASCII or EBCDIC just long enough to read the encoding meta tag:

 

However unlike XML, HTML is case-insensitive, so you also need to look for variants like this:

 

Either way, once you've found the meta element, you back up and start over once you know the encoding. (mark( ) and reset( ) are very helpful here.)

Sometimes there's metadata outside the file or stream that can help you. For instance, HTTP servers normally send a Content-type header that may include a charset parameter like this one:

Content-type: text/html; charset=sjis

If there's no explicit parameter, the protocol may give you enough information. For instance, HTTP specifies that all text/* documents are assumed to be Latin-1 (ISO-8859-1) unless explicitly specified otherwise.

Following these rules along with a smattering of local knowledge will probably suffice most of the time. If it's not enough, there are still more sophisticated tricks you can try. For instance, you can spellcheck a document in a variety of encodings and see which one generates the fewest errors for words containing non-ASCII characters. Of course, this requires you to know or make a reasonable guess at the language. That too can be done based on the stream contents if necessary. Honestly, though, very few programs need to make this level of effort.

Категории