Handling International Encodings
Problem
You need to handle strings that contain nonASCII characters: probably Unicode characters encoded in UTF-8.
Solution
To use Unicode in Ruby, simply add the following to the beginning of code.
$KCODE='u' require 'jcode'
You can also invoke the Ruby interpreter with arguments that do the same thing:
$ ruby -Ku -rjcode
If you use a Unix environment, you can add the arguments to the shebang line of your Ruby application:
#!/usr/bin/ruby -Ku -rjcode
The jcode library overrides most of the methods of String and makes them capable of handling multibyte text. The exceptions are String#length, String#count, and String#size, which are not overridden. Instead jcode defines three new methods: String#jlength, string#jcount, and String#jsize.
Discussion
Consider a UTF-8 string that encodes six Unicode characters: efbca1 (A), efbca2 (B), and so on up to UTF-8 efbca6 (F):
string = "xefxbcxa1" + "xefxbcxa2" + "xefxbcxa3" + "xefxbcxa4" + "xefxbcxa5" + "xefxbcxa6"
The string contains 18 bytes that encode 6 characters:
string.size # => 18 string.jsize # => 6
String#count is a method that takes a strong of bytes, and counts how many times those bytes occurs in the string. String#jcount takes a string of characters and counts how many times those characters occur in the string:
string.count "xefxbcxa2" # => 13 string.jcount "xefxbcxa2" # => 1
String#count treats "xefxbcxa2" as three separate bytes, and counts the number of times each of those bytes shows up in the string. String#jcount TReats the same string as a single character, and looks for that character in the string, finding it only once.
"xefxbcxa2".length # => 3 "xefxbcxa2".jlength # => 1
Apart from these differences, Ruby handles most Unicode behind the scenes. Once you have your data in UTF-8 format, you really don't have to worry. Given that Ruby's creator Yukihiro Matsumoto is Japanese, it is no wonder that Ruby handles Unicode so elegantly.
See Also
- If you have text in some other encoding and need to convert it to UTF-8, use the iconv library, as described in Recipe 11.2, "Extracting Data from a Document's Tree Structure"
- There are several online search engines for Unicode characters; two good ones are at http://isthisthingon.org/unicode/ and http://www.fileformat.info/info/unicode/char/search.htm