Ruby and Encoding

This is the second post of my series “Do you speak UTF-8?”. Here you can find the first article on this topic.

This article covers how Ruby 1.9 handles encoding internally and which tooling it provides for encoding issues. Prior to Ruby 1.9 a String was just a sequence of bytes. Calling the method size() returned the size of this byte array, not the character count. In Ruby 1.9 the Encoding is stored along with that byte array.

As you can see in this example: You may ask for the encoding of a String. size() returns the actual character count. bytesize() gives you the actual number of bytes. How these two representations differ can be seen if you compare the codepoints and the bytes.