Quick Reference Guide to Character Encoding

In working with project and localization managers, localization sales people and those who are just new to the internationalization (i18n) process, I regularly encounter confusion regarding the technical terms that describe the encoding that will be needed to support particular sets of locales and their respective languages. This post is here to simplify things and make them digestible for a topic that’s important, if you want your software to support localization, but makes many people’s eyes glaze over. There’s tons more to read on each of these entries which are by no means complete, but I have found references like Wikipedia pages and the Unicode.org site to be overwhelming for people who don’t need or want to know the details. There are more encodings, but these are the primary ones:

ASCII Encoding: English alphabet and numbers are fine. Breaks if you add accented characters and additional characters. 128 characters supported in total. 7 bits per character. Don’t use it!

ISO Latin-1 Encoding (also referred to as ISO 8859-1): In addition to English alphabet and numbers (codes are the same), adds support for Western European Languages (i.e. French, Italian, German, Spanish). 8 bits per character (single byte).

Unicode Character Set: A super character set that includes digital support for all the world’s commercially used languages and more.  Unicode characters can be supported using the following encodings:

UTF-8: This is a common encoding for supporting Unicode characters. Uses a single byte (same codes as ASCII), adding additional bytes (up to 6) to encode additional characters..

UTF-16: This is another common encoding for supporting Unicode. Uses two bytes per character and can be combined with a second set of bytes for more complex characters.

UCS-2: The older version of UTF-16, and compatible with UTF-16 as well. Used on Microsoft’s SQL Server Database for supporting Unicode.

The difference between a character and its encoding is the character is the complete letter or symbol, for example the letter A or a Chinese ideograph. Think of the encoding as the configuration of zeros and ones (bits and bytes) that the computer/software uses to represent that character. You need more bytes to represent more complex characters. People often refer to languages such as Chinese or Japanese as double byte, because it takes at least two bytes to represent their characters. However, more than two bytes can be involved.

All this is important because it’s an area that can really foul up your localization efforts if you don’t get it right. Encoding issues can be quickly located in source code and database schema using Globalyzer’s static analysis.

I wrote a more detailed article about this a while back for MultiLingual Computing, and it’s available here.

A video also describing this (we shot it in one take and with no rehearsal) is available here:

Choosing the right encoding is based on market requirements first, but also relies on technical requirements that arise from your software architecture. Supporting Unicode is always desirable, but technical and business case constraints can affect your internationalization strategy.

Related Posts

Globalization Resources