Converting from One Encoding to Another
Credit: Mauro Cicio
Problem
You want to convert a document to a given charset encoding (probably UTF-8).
Solution
If you don know the documents current encoding, you can guess at it using the Charguess library described in the previous recipe. Once you know the current encoding, you can convert the document to another encoding using Rubys standard iconv library.
Heres an XML document written in Italian, with no explicit encoding:
doc = %{
}
Lets figure out its encoding and convert it to UTF-8:
require iconv require charguess # not necessary if input encoding is known input_encoding = CharGuess::guess doc # => "windows-1252" output_encoding = utf-8 converted_doc = Iconv.new(output_encoding, input_encoding).iconv(doc) CharGuess::guess(converted_doc) # => "UTF-8"
Discussion
The heart of the iconv library is the Iconv class, a wrapper for the Unix 95 iconv( ) family of functions. These functions translate strings between various encoding systems. Since iconv is part of the Ruby standard library, it should be already available on your system.
Iconv works well in conjunction with Charguess: even if Charguess guesses the encoding a little bit wrong (such as guessing Windows-1252 for an ISO-8859-1 document), it always makes a good enough guess that iconv can convert the document to another encoding.
Like Charguess, the Iconv library is not XML-or HTML-specific. You can use libcharguess and iconv together to convert an arbitrary string to a given encoding.
See Also
- Recipe 11.11, "Guessing a Documents Encoding"
- The iconv library is documented at http://www.ruby-doc.org/stdlib/libdoc/iconv/rdoc/classes/Iconv.html; you can find pointers to The Open Group Unix library specifications
Категории