Guessing a Documents Encoding

Credit: Mauro Cicio

Problem

You want to know the character encoding of a document that doesn declare it explicitly.

Solution

Use the Ruby bindings to the libcharguess library. Once its installed, using libcharguess is very simple.

Heres an XML document written in Italian, with no explicit encoding:

doc = %{

spaghetti al ragù frappè

}

Lets find its encoding:

require charguess CharGuess::guess doc # => "windows-1252"

This is a pretty good guess: the XML is written in the ISO-8859-1 encoding, and many web browsers treat ISO-8859-1 as Windows-1252.

Discussion

In XML, the character-encoding indication is optional, and may be provided as an attribute of the XML declaration in the first line of the document:

If this is missing, you must guess the document encoding to process the document. You can assume the lowest common denominator for your community (usually this means assuming that everything is either UTF-8 or ISO-8859-1), or you can use a library that examines the document and uses heuristics to guess the encoding.

As of the time of writing, there are no pure Ruby libraries for guessing the encoding of a document. Fortunately, there is a small Ruby wrapper around the Charguess library. This library can guess with 95% accuracy the encoding of any text whose charset is one of the following: BIG5, HZ, JIS, SJIS, EUC-JP, EUC-KR, EUC-TW, GB2312, Bulgarian, Cyrillic, Greek, Hungarian, Thai, Latin1, and UTF8.

Note that Charguess is not XML-or HTML-specific. In fact, it can guess the encoding of an arbitrary string:

CharGuess::guess("xA4xCF") # => "EUC-JP"

Its fairly easy to install libcharguess, since the library is written in portable C++. Unfortunately, it doesn take care to put its header files in a standard location. This makes it a little tricky to compile the Ruby bindings, which depend on the charguess.h header. When you run extconf.rb to prepare the bindings, you must explicitly tell the script where to find libcharguesss headers. Heres how you might compile the Ruby bindings to libcharguess:

$ ruby extconf.rb --with-charguess-include=/location/of/charguess.h $ make $ make install

See Also

Категории