by Adam Asnes, President, Lingoport
As appeared in Multilingual Magazine
Chances are you’ve seen corrupted data, but perhaps didn’t think too much about it unless you’re a localization engineer. Most people see it first in their spam, coming with promises of Euro-Lottery millions or other nefarious offers. The corruption evidence is in the square boxes or random nonsensical characters that fill the subject heading or email body, if you haven’t deleted it already. What’s happening is that somewhere along the way, or in your mail client, the character encoding the message is written in is not being supported. Obviously you wouldn’t feel very confident using a product, site or system that suffers this same issue, so it’s a clear defect. Sometimes you even see it when everything is still all English, most notoriously when somewhere along the way the software system you are using can’t process a simple apostrophe.
Remember that all data on computers ultimately breaks down to zeros and ones. These values are then interpreted to form characters and then strung together as words or symbols. Corruption occurs when the interpretation of the encoded zeros and ones does not form the intended character. For example, the application thinks the encoding of a character is ISO-Latin 1 rather than UTF-8 and so displays the wrong character. We have run into several internationalization services customers over the years that have inadvertently corrupted character data buried within large databases. Here’s an example of how bad this can get:
Imagine your company is a world leader for building heavy machinery and construction equipment. You have a massive parts catalog. Over time, an unknown amount of data has experienced character corruption. The characters are no longer humanly readable. They look like gobbledygook. Or, you have a complex online customer management system with a large database of users and corresponding account information with broken character encodings sprinkled throughout.
In each case there are too many occurrences peppered throughout the data to review and manually decipher what the original intent of the content was. You can imagine the panicked conversations when the broken characters are discovered. “Oh σηιτ, look at this! How the φυχκ are we going to fix this!”
Often the instances are too scattered and it’s too difficult to roll back to previous versions of the data, as everything new would be lost, and it may not be known just when the character corruption might have started happening.
The corruption occurs in the first place when there’s some source in the application or process or reviewing data breaks the encoding. For example developers may have implemented a web page form that isn’t properly set up to return data in the correct encoding. Another possibility is that someone manually imported new data into the database, but used an editor that is not set up to handle, say UTF-8 encoding. The culprit might be as innocent as using Notepad incorrectly.
At this point, this conversation has happened with clients several times a year, and in every case, these clients already happened to be working with us in some capacity, whether on service projects or licensing our Globalyzer software. I suspect the problem isn’t actually all that uncommon. So we finally decided to take some of the advice I’ve been trumpeting in this column and productize some of our solutions. At the time of this writing, we haven’t decided on a product name yet, so we affectionately call this solution The Decombobulator. We’ll probably officially release it as something boring like db Ambassador, but we’ll always call it the Decombobulator internally because it sounds funnier. Check our website to find out if humor or practicality wins out (remember that we are probably the only company using an icon of a toilet plunger as part of an interface and utility names like PseudoJudo). In fact, I encourage you to contact me if you’d like to vote on it or suggest a better name.
So here’s how we solve this problem. The Decombobulator runs on your data or database, reviewing characters at the byte level and reporting the results. It then helps you compare character encoding to the intended encoding and then reports, suggests and helps automate the correction back to what the character was intended to be.
Here’s an example using corrupted names from a database which initially had problems with some cases of extended characters:
I’ll add that we’ve seen strings that clients have submitted to their localization vendor which also have the same types of instances of corruption. Often this happens when someone opens a file, just to check that the data is there in the first place, but then saves it again without the proper character encoding settings. The localization firm then has a number of isolated strings, perhaps including past translations, which are now broken.
I’m not illustrating all this as a sales pitch. I somehow doubt we’ll sell very much of the Decombobulator, but for the people that need it, it will be a lifesaver. In fact, much of the development and productization of the Decombobulator happened without my knowledge and even in part against my intentions. One of our team just took it upon himself to take extra time while getting his other work done, to enhance what we had and put it together. I bring this all up because in your business, you likely encounter some problems just like this which are just begging for a repeatable and scalable approach that will make you a savior to your client or coworkers. And if you can repackage it for the benefit of your organization or clientele, you’ve just created a significant differentiating value. That’s what people love to buy, whether it’s you selling your continued employment or cementing a client relationship. This doesn’t mean you learn software development on the side if you’re not a developer. Every process presents its own opportunities.
The economy is rough out there. I won’t bother parroting what you’re no doubt reading. It may be that one of the few bright spots is still the language services and technology industry. I talk to quite a few CEO’s of localization companies and they all seem to be reporting that business is holding up, but they are crossing all their fingers and toes that it stays that way. If I were in the automobile or furniture business in the US, I’d be beyond scared. But the fact is that the entire language computing industry directly connects to helping technology firms make more money. Notice I didn’t say save money. While that’s important too, making money always wins. So the way that we differentiate our industry and for our clients and co-workers is by innovating in ways that get work done faster, better and cheaper, so that someone can sell something more effectively anywhere in the world. And that’s just great business.