i18n Resource File Formats: RESX and RESW files

Next up in our blog series on localization resource file formats are Windows-related formats. RESX files are used by programs developed with Microsoft’s .NET Framework. They store objects and strings for a program in an XML format. They may contain both plain text information as well as binary data, which is encoded as text within the XML tags.

RESW files are used by Microsoft Windows and Silverlight applications and contain strings that are used to localize the application for different languages and contexts. They are often used with XAML applications (such as Expression), which abstract the user interface strings to resource files. Let’s have a closer look at how these files look like in terms of formatting:

Syntax of the RESW and RESX resource file format:

  • documents start with <?xml version=”1.0″ encoding=”ENCODING”?> where ENCODING is desired encoding
  • key-value pairs are nested within a <root> element and have this form:

<data name=”key” xml:space=”preserve”><value>value</value></data>

  • place-holder syntax is: {name}, where “name” can be a combination of non-white-space characters
  • HTML comments preceding a key-value pair are treated as a translation descriptions belonging to that pair, and can contain LingoChecks
  • we use UTF-8 encoding for RESJSON resource files exports by default, but we also support other encodings our users might prefer.

Example of the RESX/RESW resource file format:

Thanks for reading. Let us know if  you have questions on using RESW or RESX resource files for your localization projects. We also support newer files for Microsoft-related projects, such as Windows 8 or RT, see blog entry. Our series on localization file formats will be continued. We’ve previously covered .ini , .strings and .properties files, for example. I am looking forward to your comments.

lingohub API ready for BaRuCo

We know the importance of an API. That is why we designed lingohub from the beginning with a REST interface in mind. The only thing that was missing (until now), was the documentation.

Rest API

Today, timely for the BaRuCo, we released our Developer Center https://lingohub.com/developers. You will find documentation about REST endpoints, restriction, authentication, and everything else you need to integrate with lingohub. The Version 1 focuses mostly on the vital ressources of lingohub: projects, collaborators, translations and phrases. We are thinking about adding some convenience endpoints, but we wanted your feedback first. Contact us and let us know your suggestions, critic, and praises.

Lastly, greetings to our friends at BaRuCo. Shame not all lingohuber made it, but I am sure Markus will represent lingohub pretty well at the Github Drink Up.

Ruby: ensure_encoding to ensure your encoding

Consider following situation: you receive a file, either from your drive or from HTTP, and you have no clue what encoding was used for the content of this file. Nothing is stored with the content that indicates the encoding. Even scanning the file’s content doesn’t help to derive the encoding. It is just a series of bits and bytes. Every byte maps to a different character based on the used charset. A ISO-8859-1 file may have the same byte sequence as a UTF-8 file. But just by applying the correct encoding as mapping will lead to a meaningful sequence of characters.

You can consider yourself lucky if:

  • you know the file type and which can indicate the encoding
  • the actual encoding is stored inside the file’s content. For example the encoding directive in XML files (which of course can be wrong as well)

At lingohub we use file types to parse language resource files:

Unfortunately, this is the theory, in practice the file’s encoding can be anything . Even developers touching this resource files might not know which encoding is set in their editor. They never had to think about that. The code they edit just works fine since it hardly includes localized characters or is just executed in a similar environment as it was written.
But now imagine that you send your language resource files to your translator. The resource file then gets edited on a Windows systems and is would be automatically saved CP-1257 encoded. You don’t expect that a translator with no technical background will be aware of the fact that your resource file parser actually expects UTF-16, do you?

At lingohub had to find a solution for such a situation. Our import must handle *all* resource files regardless of the used encoding. Like above mentioned there is not actual evidence that indicates the encoding 100%, however, we found a way which works quite well:

  1. Importing the file in binary format.
  2. Start to apply an encoding to this byte sequence,
  3. if it fails, try the next encoding.
  4. Until the conversion works out fine.

Because this is a common approach there are several implementations out there. We have chosen ensure_encoding. It does a great job (even it is called experimental).

After requiring this gem through

  • gem ‘ensure-encoding’, ’0.1′

it will add the ensure_encoding() method to String. This will give you the power to convert the given String to your preferred encoding without actually knowing the input encoding.

This usage will try to convert a String to UTF-8 while ‘sniffing’ the actual input encoding. 

Since it is not always possible to convert every character to the chosen target encoding there are several options how to handle that situation by setting the :invalid_characters option:

  • :transcode – will always try to convert characters
  • :raise – will raise an exception
  • :drop – will drop all non convert-able characters
The readme covers some scenarios, so one may choose which option would be the best fit.

While the ‘:external_encoding => :sniff’ option works great for UTF-16 and UTF-8, it is not able to handle all encodings we have to support for importing/exporting i18n resource files, we haven chosen to give ensure_encoding a hint which encodings we expect:

Do you speak UTF-8?

“ @ mperham : You have a problem, your data is in latin1 so you think : ” I’ll convert to UTF8 !” Now you have � problems .” cc @ kingshy_g

Everyone of us coders who dealt with encodings felt that pain, didn’t you?

During my developers career I was quite lucky. Had to deal with encodings quite seldom.
And if I had to: Ok, it wasn’t my fault. The provider of the data had chosen (by not knowing it better) that exotic encoding. But I was in charge to solve this problem!

Actually for me the whole encoding issue feels like a neverending Y2K bug.
We have the proper encodings nowadays, but we as computerists were not able to bring this topic to an end.

While reading different resource file formats Lingohub has to deal with this subject:

  • Java resource bundles are stored in ISO8859-1 with UTF-16 escapes
  • iOS strings are stored in UTF-16 (sometimes you have to guess: little/big endian)
  • XML: encoding=”UTF-8“. Good idea! But this could be a lie (by copy/paste)
  • some other formats do not have a defined encoding, nor you have any metadata that give you that information. So you have to know in your application which encoding it will be

Ok, Ok. This was just a rant and won’t give you any solutions.
I will finish it for today and will start this topic as a series of posts to give you some ideas how we solved some of our issues in the encoding domain.