Handling International Encodings

2017-11-03 09:05:06

Problem

You need to handle strings that contain nonASCII characters: probably Unicode characters encoded in UTF-8.

Solution

To use Unicode in Ruby, simply add the following to the beginning of code.

$KCODE='u' require 'jcode'

You can also invoke the Ruby interpreter with arguments that do the same thing:

$ ruby -Ku -rjcode

If you use a Unix environment, you can add the arguments to the shebang line of your Ruby application:

#!/usr/bin/ruby -Ku -rjcode

The jcode library overrides most of the methods of String and makes them capable of handling multibyte text. The exceptions are String#length, String#count, and String#size, which are not overridden. Instead jcode defines three new methods: String#jlength, string#jcount, and String#jsize.

Discussion

Consider a UTF-8 string that encodes six Unicode characters: efbca1 (A), efbca2 (B), and so on up to UTF-8 efbca6 (F):

string = "xefxbcxa1" + "xefxbcxa2" + "xefxbcxa3" + "xefxbcxa4" + "xefxbcxa5" + "xefxbcxa6"

The string contains 18 bytes that encode 6 characters:

string.size # => 18 string.jsize # => 6

String#count is a method that takes a strong of bytes, and counts how many times those bytes occurs in the string. String#jcount takes a string of characters and counts how many times those characters occur in the string:

string.count "xefxbcxa2" # => 13 string.jcount "xefxbcxa2" # => 1

String#count treats "xefxbcxa2" as three separate bytes, and counts the number of times each of those bytes shows up in the string. String#jcount TReats the same string as a single character, and looks for that character in the string, finding it only once.

"xefxbcxa2".length # => 3 "xefxbcxa2".jlength # => 1

Apart from these differences, Ruby handles most Unicode behind the scenes. Once you have your data in UTF-8 format, you really don't have to worry. Given that Ruby's creator Yukihiro Matsumoto is Japanese, it is no wonder that Ruby handles Unicode so elegantly.

Категории