Converting from One Encoding to Another

Credit: Mauro Cicio

Problem

You want to convert a document to a given charset encoding (probably UTF-8).

Solution

If you don know the documents current encoding, you can guess at it using the Charguess library described in the previous recipe. Once you know the current encoding, you can convert the document to another encoding using Rubys standard iconv library.

Heres an XML document written in Italian, with no explicit encoding:

doc = %{

spaghetti al ragù frappè

}

Lets figure out its encoding and convert it to UTF-8:

require iconv require charguess # not necessary if input encoding is known input_encoding = CharGuess::guess doc # => "windows-1252" output_encoding = utf-8 converted_doc = Iconv.new(output_encoding, input_encoding).iconv(doc) CharGuess::guess(converted_doc) # => "UTF-8"

Discussion

The heart of the iconv library is the Iconv class, a wrapper for the Unix 95 iconv( ) family of functions. These functions translate strings between various encoding systems. Since iconv is part of the Ruby standard library, it should be already available on your system.

Iconv works well in conjunction with Charguess: even if Charguess guesses the encoding a little bit wrong (such as guessing Windows-1252 for an ISO-8859-1 document), it always makes a good enough guess that iconv can convert the document to another encoding.

Like Charguess, the Iconv library is not XML-or HTML-specific. You can use libcharguess and iconv together to convert an arbitrary string to a given encoding.

See Also

Категории