by Developers No Comments

Consider following situation: you receive a file, either from your drive or from HTTP, and you have no clue what encoding was used for the content of this file. Nothing is stored with the content that indicates the encoding. Even scanning the file’s content doesn’t help to derive the encoding. It is just a series of bits and bytes. Every byte maps to a different character based on the used charset. A ISO-8859-1 file may have the same byte sequence as a UTF-8 file. But just by applying the correct encoding as mapping will lead to a meaningful sequence of characters.

You can consider yourself lucky if:

  • you know the file type and which can indicate the encoding
  • the actual encoding is stored inside the file’s content. For example the encoding directive in XML files (which of course can be wrong as well)

At lingohub we use file types to parse language resource files:

Unfortunately, this is the theory, in practice the file’s encoding can be anything . Even developers touching this resource files might not know which encoding is set in their editor. They never had to think about that. The code they edit just works fine since it hardly includes localized characters or is just executed in a similar environment as it was written.
But now imagine that you send your language resource files to your translator. The resource file then gets edited on a Windows systems and is would be automatically saved CP-1257 encoded. You don’t expect that a translator with no technical background will be aware of the fact that your resource file parser actually expects UTF-16, do you?

At lingohub had to find a solution for such a situation. Our import must handle *all* resource files regardless of the used encoding. Like above mentioned there is not actual evidence that indicates the encoding 100%, however, we found a way which works quite well:

  1. Importing the file in binary format.
  2. Start to apply an encoding to this byte sequence,
  3. if it fails, try the next encoding.
  4. Until the conversion works out fine.

Because this is a common approach there are several implementations out there. We have chosen ensure_encoding. It does a great job (even it is called experimental).

After requiring this gem through

  • gem ‘ensure-encoding’, ‘0.1’

it will add the ensure_encoding() method to String. This will give you the power to convert the given String to your preferred encoding without actually knowing the input encoding.

This usage will try to convert a String to UTF-8 while ‘sniffing’ the actual input encoding. 

Since it is not always possible to convert every character to the chosen target encoding there are several options how to handle that situation by setting the :invalid_characters option:

  • :transcode – will always try to convert characters
  • :raise – will raise an exception
  • :drop – will drop all non convert-able characters
The readme covers some scenarios, so one may choose which option would be the best fit.

While the ‘:external_encoding => :sniff’ option works great for UTF-16 and UTF-8, it is not able to handle all encodings we have to support for importing/exporting i18n resource files, we haven chosen to give ensure_encoding a hint which encodings we expect:

About 

CTO and co-founder of Lingohub

Trackbacks/Pingbacks

  1.  Ensuring proper Java character encoding of byte streams | Blog
  2.  Problems with your byte streams? | GeekTime
  3.  Problems with your byte streams? | BLACKSIDERS MAG

Leave a Reply

  • (will not be published)