Internationalization Management Tips: 10 Mistakes to Avoid

It’s extremely common for us to work with clients who have had a bumpy past with regards to internationalization. Sometimes you have to learn things the hard way, but that is always expensive.

In the past I’ve written about ten tips for managing internationalization projects. Here’s a look at mistakes that I’ve commonly seen repeated on the client side. In our services practice at Lingoport, we often have to council our clients through one or more of these sorts of process issues, which is actually a very rewarding part of what we do. While this list is pretty high level, we’ve seen that the processes involved can set up cascading failures that eventually can have a serious impact on a project’s success. Some apply more to internationalization of existing applications; others can apply to development where internationalization is planned in from the point of conception (still kind of a rare thing, but gaining).

So, here are 10 internationalization process mistakes to avoid:

1. Don’t forget what drives internationalization:

Money on the top and bottom lines of your company’s balance sheet. The point here is that the costs of being late or lousy endure way beyond benefits of cutting corners on development. Internationalization happens because of a:
a. New customer(s) sale
b. New partnership
c. Strategic initiative backed by marketing, legal and other types of efforts and investments

2. Don’t assume internationalization is just an older software legacy issue.

It comes up surprisingly often that people even in our industry think that internationalization is mainly an issue for older applications. No framework, whether it’s J2EE, .Net, Ruby on Rails, PHP or whatever is new and improved, internationalizes itself. You still need to do all the steps necessary to implement locale and all the associated internationalization practices. Many newer programming platforms do an excellent job of internationalization support, which is great news as you can estimate and execute with a higher degree of accuracy. But you still have plenty of work to do.

3. Don’t assume you can treat internationalization like any other feature improvement when it comes to source control management.

With internationalization source control can need an extra step of thinking things through. It’s very typical for new feature development and bug fixing to be going on in parallel to internationalization efforts. However, in the process of performing internationalization, you are going to be breaking major pieces of functionality within your application as you make large changes to your database and other application components. In order for respective developers to work on their own tasks and bugs, you typically need to branch code, often with specifically orchestrated code merges.

4. Don’t assume internationalization is just a string externalization exercise.

Prevent corrupted software strings with Lingoport's software internationalization toolString externalization is important and highly visible, but the scope of internationalization includes so much more. For example: creating a locale framework, character encoding support, major changes to the database, refactoring of methods/functions and classes for data input, manipulation and output. How these are all approached, varies greatly based on requirements and technologies.

5. Don’t wing it on Locale

Designing how locale will be selected and managed often doesn’t get the amount of thought and planning deserved. How the application interacts with the user, detects or selects locale, and then how it correspondingly behaves is a design process needing input from an experienced architect, product marketing and the development team. This is not an area to be chosen by any one representative by fiat. It’s a whole lot of work to redo locale if it’s executed inadequately for user, business and locale requirements.

6. Don’t create your very own internationalization framework

Don’t even do it if you think you know better. We regularly run into clients who have half-way implemented internationalization using their own homegrown methods for string extraction and locale management when there were already well establish methods provided within their programming language framework or established solutions like ICU. Using these will ensure that your code is far easier to maintain, and you’ll know that thousands of applications have used them successfully before you. No unpleasant surprises.

7. Don’t think that the team internationalizing your software can work without a working build

This seems obvious, but it comes up lots. Without a working build, the developers can’t smoke test the changes they are making. Even if you provide a dedicated QA person, my own experience is that developers need to be able to compile and run themselves to head off problems later. It’s too hard to rely on reconstructing coding errors at a later time and make for unnecessary bug fixing iterations, lost time and poor quality.

8. Don’t run out of money

Internationalization planning often suffers from underscoping. At Lingoport, we have both software and well established methodologies for estimating internationalization, as we really don’t want to ever break this rule and have to ask our clients for more funding. Same should hold true for internal efforts. Lapses in funding can cause expensive delays, as new funding takes more time than anyone imagined to get approved. It also reduces management credibility. And chances are, if you need to ask for more money, than you also need more time, which brings you back to consequences regarding tip #1.

9. Don’t use a half thought-out character encoding strategy

Use Unicode, rather than native encodings. If you have budget and time constraints and you’re only targeting dominant languages in markets like Western Europe, North and South America, you can often get away with ISO Latin – 1, but even for Eastern European languages, go Unicode. Then when you do, make sure your encoding works all the way through the application. And don’t forget that if your customer needs to support worldwide customers themselves (e.g. enterprise software), they may need you to support Unicode data processing even if the interface remains in English. One more consideration tilting toward Unicode is that programming languages like C# and Java already internally pass strings and data as Unicode, so you might as well think about engineering for the world.

10. Don’t use your same testing plan, or just rely on localization testing, when your functional testing needs to grow to include internationalization requirements

In our services projects, we always put special emphasis on working through pseudo-localization of not only the interface, but sending test data using target character sets, locale altered date/time formats, phone numbers and more, from data input to database, to reports and so on. If your testers are English only speakers, that’s fine. For example, we have a utility, PseudoJudo in one Globalyzer that puts target language buffer characters surround English strings. You can expand data fields to fit physically longer strings giving room for translation changes in sizing as well as encoding.

11. Bonus Tip: Don’t assume localization is just someone else’s problem

It’s funny how many of our customers are strictly concerned with software development and don’t actually have anything to do with localization processes. We always work to bring together localization into the internationalization effort. We do this by interfacing localization resources early on, helping them understand the technical requirements and then feeding translators strings that we extract on the front end of projects, so that when internationalization functional testing is done, we are immediately ready to perform linguistic translation testing and ultimately deliver a finished product. This compresses times to global release, while also making for a more fluid process, less programming iterations and higher quality.

Unicode and Internationalization Primer for the Uninitiated

Among our friends and clients at Lingoport, we regularly see ranges of confusion, to complete lack of awareness of what Unicode is. So for the less- or under-informed, perhaps this article will help. The advent of Unicode is a key underpinning for global software applications and websites so that they can support worldwide language scripts. So it’s a very important standard to be aware of, whether you’re in localization, an engineer or a business manager.

Unicode and Internationalization

Html Unicode and XML character encoding

Firstly, Unicode is a character set standard used for displaying and processing language data in computer applications. The Unicode character set is the entire world’s set of characters, including letters, numbers, currencies, symbols and the like, supporting a number of character encodings to make that all happen. Before your eyes glaze over, let me explain what character encoding means. You have to remember that for a computer, all information is represented in zeros and ones (i.e. binary values). So if you think of the letter A in the ASCII standard of zeros and ones it would look like this: 1000001. That is, a 1 then five zeros and a 1 to make a total of 7 bits. This binary representation for A is called A’s code point, and this mapping of zeros and ones to characters is called the character encoding. In the early days of computing, unless you did something very special, ASCII (7 bits per character) was how your data got managed. The problem is that ASCII doesn’t leave you enough zeros and ones to represent extended characters, like accents and characters specific to non-English alphabets, such as you find in European languages. You certainly can’t support the complex characters that make up Chinese, Korean and Japanese languages. These languages require 8-bit (single-byte) or 16-bit (double-byte) character encodings. One important note on all of these single- and double-byte encodings is that they are a superset of 7-bit ASCII encoding, which means that English code points will always be the same regardless the encoding.

The Bad Old Days

In the early computing days, specific character single- and double-byte encodings were developed to support various languages. That was very bad, as it meant that software developers needed to build a version of their application for every language they wanted to support that used a different encoding. You’d have the Japanese version, the Western European language version, the English-only version and so on. You’d end up with a hoard of individual software code bases, each needing their own testing, updating and ongoing maintenance and support, which is very expensive, and pretty near impossible for businesses to realistically support without serious digressions among the various language versions over time. You don’t see this problem very often for newly developed applications, but there are plenty of holdovers. We see it typically when a new client has turned over their source code to a particular country partner or marketing agent which was responsible for adapting the code to multiple languages. The worst case I saw was in 2004 when a particular client, who I will leave unmentioned, had a legacy product with 18 separate language versions and had no real idea any longer the level of functionality that varied from language to language. That’s no way to grow a corporate empire!

ISO Latin

A single-byte character set that we often see in applications is ISO Latin 1, which is represented in various encoding standards such as ISO-8859-1 for UNIX, Windows-1252 for Windows and MacRoman on guess what platform. This character set supports characters used in Western European languages such as French, Spanish, German, and U.K. English. Since each character requires only a single byte, this character set provides support for multiple languages, while avoiding the work required to support either Unicode or a double-byte encoding. Trouble is that still leaves out much of the world. For example, to support Eastern European languages you need to use a different character set, often referred to as Latin 2, which provides the characters that are uniquely needed for these languages. There are also separate character sets for Baltic languages, Turkish, Arabic, Hebrew, and on and on. When having to internationalize software for the first time, sometimes companies will start with just supporting ISO Latin 1 if it meets their immediate marketing requirements and deal with the more extensive work of supporting other languages later. The reason is that it’s likely these software applications will need major reworking of the encoding support in their database and functions, methods and classes within their source code to go beyond ISO Latin support, which means more time and more money – often cascading into later releases and foregone revenues. However, if the software company has truly global ambitions, they will need to take that plunge and provide Unicode support. I’ll argue that if companies are supporting global customers, and even not doing a bit of translation/localization for the interface, they still need to support Unicode so they can provide processing of their customer’s global data.


We come back to Unicode, which as we mentioned above, is a character set created to enable support of any written language worldwide. Now you might find a language or two lacking Unicode support for its script but that is becoming extremely isolated. For instance, currently Javanese, Loma, and Tai Viet are among scripts not yet supported. Arcane until you need them I suppose. I remember a few years ago when we were developing a multi-lingual site which needed support for Khmer and Armenian, and we were thankful that Unicode had just added their support a few months prior. If you have a marketing requirement for your software to support Japanese or Chinese, think Unicode. That’s because you will need to move to a double-byte encoding at the very least, and as soon as you go through the trouble to do that, you might as well support Unicode and get the added benefit of support for all languages.


Once you’ve chosen to support Unicode, you must decide on the specific character encoding you want to use, which will be dependent on the application requirements and technologies. UTF-8 is one of the commonly used character encodings defined within the Unicode Standard, which uses a single byte for each character unless it needs more, in which case it can expand up to 4 bytes. People sometimes refer to this as a variable-width encoding since the width of the character in bytes varies depending upon the character. The advantage of this character encoding is that all English (ASCII) characters will remain as single-bytes, saving data space. This is especially desirable for web content, since the underlying HTML markup will remain in single-byte ASCII. In general, UNIX platforms are optimized for UTF-8 character encoding. Concerning databases, where large amounts of application data are integral to the application, a developer may choose a UTF-8 encoding to save space if most of the data in the database does not need translation and so can remain in English (which requires only a single byte in UTF-8 encoding). Note that some databases will not support UTF-8, specifically Microsoft’s SQL Server.


UTF-16 is another widely adopted encoding within the Unicode standard. It assigns two bytes for each character whether you need it or not. So the letter A is 00000000 01000001 or 9 zeros, a one, followed by 5 zeros and a one. If more than 2 bytes are needed for a character, four bytes can be combined, however you must adapt your software to be capable of handling this four-byte combination. Java and .Net internally process strings (text and messages) as UTF-16.

For many applications, you can actually support multiple Unicode encodings so that for example your data is stored in your database as UTF-8 but is handled within your code as UTF-16, or vice versa. There are various reasons to do this, such as software limitations (different software components supporting different Unicode encodings), storage or performance advantages, etc.. But whether that’s a good idea is one of those “it depends” kinds of questions. Implementing can be tricky and clients pay us good money to solve this.

Microsoft’s SQL Server is a bit of a special case, in that it supports UCS-2, which is like UTF-16 but without the 4-byte characters (only the 16-bit characters are supported).

GB 18030

There’s also a special-case character set when it comes to engineering for software intended for sale in China (PRC), which is required by the Chinese Government. This character set is GB 18030GB 18030, and it is actually a superset of Unicode, supporting both simplified and traditional Chinese. Similarly to UTF-16, GB 18030 character encoding allows 4 bytes per character to support characters beyond Unicode’s “basic” (16-bit) range, and in practice supporting UTF-16 (or UTF-8) is considered an acceptable approach to supporting GB 18030 (the UCS-2 encoding just mentioned is not, however).

Now all of this considered, a converse question might be, what happens when you try to make your application support complex scripts that need Unicode, and the support isn’t there? Depending upon your system, you get anything from garbled and meaningless gibberish where data or messages become corrupted characters or weird square boxes, or the application crashes forcing a restart. Not good.

If your application supports Unicode, you are ready to take on the world.

I18n JavaScript – the Good, the Bad, and the Ugly

i18n JavaScript: Given JavaScript’s status as the de facto browser client scripting language, and given the international nature of the Internet, it was inevitable that JavaScript and internationalization (i18n) would eventually cross paths. Fortunately, in this day and age of Unicode, character corruption can be avoided if care is taken to make sure JavaScript is using it. Unfortunately, strings are hard coded in JavaScript and locale-specific methods are unpredictable, making localization more difficult.

To continue reading, and to see how JavaScript strings and data formatting can be supported by your selected locale, please fill out the form below. A brief preview:

Assuming currentLocale is set to English (US), the resulting code block should look like this:

Current Locale Resulting Block | Internationalize JavaScript


Enter Your Information to Download the White Paper

  • This field is for validation purposes and should be left unchanged.