Ruby: ensure_encoding to ensure your encoding

Consider following situation: you receive a file, either from your drive or from HTTP, and you have no clue what encoding was used for the content of this file. Nothing is stored with the content that indicates the encoding. Even scanning the file’s content doesn’t help to derive the encoding. It is just a series of bits and bytes. Every byte maps to a different character based on the used charset. A ISO-8859-1 file may have the same byte sequence as a UTF-8 file. But just by applying the correct encoding as mapping will lead to a meaningful sequence of characters.

You can consider yourself lucky if:

  • you know the file type and which can indicate the encoding
  • the actual encoding is stored inside the file’s content. For example the encoding directive in XML files (which of course can be wrong as well)

At lingohub we use file types to parse language resource files:

Unfortunately, this is the theory, in practice the file’s encoding can be anything . Even developers touching this resource files might not know which encoding is set in their editor. They never had to think about that. The code they edit just works fine since it hardly includes localized characters or is just executed in a similar environment as it was written.
But now imagine that you send your language resource files to your translator. The resource file then gets edited on a Windows systems and is would be automatically saved CP-1257 encoded. You don’t expect that a translator with no technical background will be aware of the fact that your resource file parser actually expects UTF-16, do you?

At lingohub had to find a solution for such a situation. Our import must handle *all* resource files regardless of the used encoding. Like above mentioned there is not actual evidence that indicates the encoding 100%, however, we found a way which works quite well:

  1. Importing the file in binary format.
  2. Start to apply an encoding to this byte sequence,
  3. if it fails, try the next encoding.
  4. Until the conversion works out fine.

Because this is a common approach there are several implementations out there. We have chosen ensure_encoding. It does a great job (even it is called experimental).

After requiring this gem through

  • gem ‘ensure-encoding’, ’0.1′

it will add the ensure_encoding() method to String. This will give you the power to convert the given String to your preferred encoding without actually knowing the input encoding.

This usage will try to convert a String to UTF-8 while ‘sniffing’ the actual input encoding. 

Since it is not always possible to convert every character to the chosen target encoding there are several options how to handle that situation by setting the :invalid_characters option:

  • :transcode – will always try to convert characters
  • :raise – will raise an exception
  • :drop – will drop all non convert-able characters
The readme covers some scenarios, so one may choose which option would be the best fit.

While the ‘:external_encoding => :sniff’ option works great for UTF-16 and UTF-8, it is not able to handle all encodings we have to support for importing/exporting i18n resource files, we haven chosen to give ensure_encoding a hint which encodings we expect:

Ruby and Encoding

This is the second post of my series “Do you speak UTF-8?”. Here you can find the first article on this topic.

This article covers how Ruby 1.9 handles encoding internally and which tooling it provides for encoding issues. Prior to Ruby 1.9 a String was just a sequence of bytes. Calling the method size() returned the size of this byte array, not the character count. In Ruby 1.9 the Encoding is stored along with that byte array.

As you can see in this example: You may ask for the encoding of a String. size() returns the actual character count. bytesize() gives you the actual number of bytes. How these two representations differ can be seen if you compare the codepoints and the bytes.

Do you speak UTF-8?

“ @ mperham : You have a problem, your data is in latin1 so you think : ” I’ll convert to UTF8 !” Now you have � problems .” cc @ kingshy_g

Everyone of us coders who dealt with encodings felt that pain, didn’t you?

During my developers career I was quite lucky. Had to deal with encodings quite seldom.
And if I had to: Ok, it wasn’t my fault. The provider of the data had chosen (by not knowing it better) that exotic encoding. But I was in charge to solve this problem!

Actually for me the whole encoding issue feels like a neverending Y2K bug.
We have the proper encodings nowadays, but we as computerists were not able to bring this topic to an end.

While reading different resource file formats Lingohub has to deal with this subject:

  • Java resource bundles are stored in ISO8859-1 with UTF-16 escapes
  • iOS strings are stored in UTF-16 (sometimes you have to guess: little/big endian)
  • XML: encoding=”UTF-8“. Good idea! But this could be a lie (by copy/paste)
  • some other formats do not have a defined encoding, nor you have any metadata that give you that information. So you have to know in your application which encoding it will be

Ok, Ok. This was just a rant and won’t give you any solutions.
I will finish it for today and will start this topic as a series of posts to give you some ideas how we solved some of our issues in the encoding domain.