Sun Java Solaris Communities My SDN Account Join SDN
 
Article

I18n in Software Design, Architecture and Implementation

 

One of the problems software engineers face when internationalizing a product is discovering too late that the product design cannot accommodate international requirements. This article is to help designers, architects, implementers, testers, and writers understand the areas where internationalization can affect appearance and structure. Common misconceptions about internationalization will also be discussed.

Internationalization is often abbreviated as i18n, since there are 18 letters between the i and the n. Like the abbreviation int'l, no capitalization should be used except in title case.

A general comment about universal design: Making one product with the same interface for the entire world will mean some sacrifices in usability for specific localities and cultures. This is a known trade-off. The alternative is to have many customized interfaces and task flows designed for various markets around the world. For most products, this is fiscally unrealistic. In some cases, companies prefer to maintain a standard look regardless of where the product is sold. If your company plans to localize the product, including images, for every market the product is sold into, then you don't need to consider universal design for the visual aspects of the user interface. But the architectural considerations will still be the same, since the goal is to create a single code base for easy maintenance.

Charsets and Locales

Charsets and locales are the two basic elements or building blocks of internationalization. Charsets are character encoding scheme names, a bytecode representation of a coded character set. For more information on character set terminology, please see RFC 2130 - The Report of the IAB Character Set Workshop, 29 February - 1 March, 1996. Locales are indicators of language and cultural formatting. For more detailed information on locales, see the chapter on locales in the Solaris 8 International Language Environments Guide, http://docs.sun.com/db/doc/806-0169/.

All textual data has a charset associated with it, and that charset must be known in order to process the data correctly. Any time data is passed from one program module to another, the charset must be indicated either explicitly or implicitly. The charset can be defined as part of the protocol or API, or can be explicitly provided in a header or descriptor area. Note that it may still be too soon to serve UTF-8 or UTF-16 to end users, although this can be acceptable for sophisticated corporate users.

Data formats, especially numeric formats, are frequently based on locale. Locale is most often a user property. As a result, locale formatting should be dynamic. Instead of putting formatting into a localizable resource file, the program should determine the locale of the current user and format the data accordingly. The key is detecting what the current user locale is - not always a straightforward task. Sometimes the locale must be specified by the user in some sort of preferences area in order to find out the correct one, and this must be designed into the product. Formats should be taken from the underlying platform, for a couple of reasons:

  1. The platform groups have spent quite a lot of time researching and assembling this information.
  2. There are no official standards for locale formats, since even within some national standards bodies the members cannot agree.

Platform locales do not include everything you may need for a locale. For example, address, telephone number, and proper name formats are not usually available. There are some centralized sources for this sort of information on the Web; they are mostly not authoritative, but are usually based on solid research.

Note: The locale information from one platform to another, even within the same company, does not necessarily correspond. For example, Solaris locale formats are not identical to Java locale formats.

Data Flow

Consider the entire flow of data through the product. For example, in a calendar application, the data might go through the following processes:

  1. The user creates a new calendar entry: data input in a particular charset.
  2. The submitted entry is processed into protocol format: possible charset conversion and data parsing issues, protocol parameter setting according to language or charset.
  3. Theprotocol stream is then sent to the server for entry into the calendar: protocol parameter interpretation, charset acceptance, storage tagging.
  4. Theuser views a calendar: request sent to server for calendar.
  5. Therequest is formatted: date range, which may require date format parsing, other specifics sent to server.
  6. Theserver returns the requested information: data formatted into protocol stream with parameters set to describe language, charset, or other areas affecting international data.
  7. Theprotocol stream is reformatted for output back to the user: charset conversion may be necessary, locale-specific formats for dates and other fields are applied, data parameters for output format must be included.

Obviously a calendar application has many more areas of data flow and processing, but this is the approach to take when inspecting for design issues in the data processing portion of the product. Remember, everywhere data is touched, be it from the user, from another product such as a directory, or from the program resource files, internationalization must be considered.

User Interface (UI)

The UI is the most exposed and obvious portion of any software package. As a result, it tends to be the first component people think of when they think of internationalization. And so it is the birthplace of many myths and misconceptions.

In addition to receiving more attention from the creators of the product for international, the UI can also receive disproportionate attention from international users. If something is wrong in the UI, users will notice it immediately and probably report it almost as fast. So it is important to design the UI correctly from the beginning. Aspects of UI design are covered throughout this article.

Shared libraries

One of the easiest ways to insure product interoperability is to share libraries. Since processing international data can be accomplished numerous ways, sharing libraries is essential to ensure the integrity of the data.

There are a couple of public cross-platform internationalization interfaces or libraries: Java and International Components for Unicode (ICU). Java is, of course, an entire language and platform. It contains a tremendous number of classes and methods for managing international data, making internationalization easy, although not foolproof! ICU is an open-source library in C/C++, which is based on IBM® code and is now a worldwide collaborative effort. ICU can be considered a C/C++ counterpart to the Java classes and methods; it is a cross-platform library that includes locale-specific formatting, language-specific sorting and searching, and a resource file mechanism. There is also ICU4J, a supplemental Java library with additional i18n classes. Be aware that when Java implements the functionality currently available in ICU4J, the classes and methods may not be compatible.

Additional functionality specific to a product must be written by the product group. Bear in mind, however, that if other products use your product, then these other products may need the functionality in your product-specific i18n library. For example, a directory server may include a number of LDAP tools with i18n built in. These tools will be used by products that rely on the directory, and thus may need to incorporate additional requirements beyond the original projections based on a standalone installation.

Interaction with other products

This is an area, which has not been emphasized in the past, but is becoming important to the success of many products. Software products within a company can be dependent on each other and on external products. If one of the pieces is not internationalized properly, it can prevent the internationalization of the entire ensemble.

Design requirements should address the needs of dependent products. If an application server cannot process application data in the Simplified Chinese charset EUC-CN, then chances are all the applications running on the server will not be able to accept EUC-CN data. Core level products need to be especially sensitive to this. Products such as directories, user management, Web servers, etc., need to be fully internationalized with well-defined interfaces and APIs sooner than higher-level products, such as trading software.

Be especially careful about choosing a third-party component or product to include functionality. More often than not, third party products are not internationalized, and this can block internationalization in the including product. If the code rights are purchased, then the product group is often saddled with retrofitting i18n into someone else's code. Worse, sometimes the design itself is flawed, rendering the product useless to your company. Any time a third party product is evaluated for use or purchase, the product group should look very carefully at its internationalization.

Implicit inclusion vs. explicit inclusion

Data coming into a product has a charset associated with it. If at all possible, use the charset information associated with the data to feed to a library or platform that has a large set of charset handling functions enabled. That is, rather than enabling charsets one by one as customers "ask" for them, enable as many as possible. The same holds true for locales - enable all that you can. The main drawback to this approach is that there usually aren't enough resources to test all charsets and locales that are enabled. Spot check all that can be fit into the schedule, making sure a good cross-section are covered. Unless you have a fault-tolerant or highly accurate software requirement, it is usually more satisfying for customers to have their charset and locale mostly working with a few small bugs than not working at all. It is difficult to completely predict all the languages and locales in which your product will be used. Try to provide functionality for a wide a variety of data, formats, layouts, etc., as possible.

International requirements

Some internationalization does not involve the processing of international data at all. Different markets have different requirements; this is why companies divide their sales areas into separate regions. Customers in other parts of the world use products for other purposes than those in the US. They approach products differently, and they may have requirements that are not included in a US specification. It is important when gathering requirements for a product that international customers, sales personnel, and marketers are consulted.

Some clear examples come from the auto industry. In England, people use their headlights to signal other drivers for various courtesies, usually letting someone else into traffic or giving someone else right-of-way. They need a function to flash their headlights even when they aren't switched on. The controls in a car made for the US wouldn't work, particularly in cars that have retracting headlights. The Acura/Honda Integra, which has retracting headlights, has a switch in England models that lifts up the headlights, flashes them once, and retracts them again. Another example of international requirements for cars comes from India. Cars there are required to emit a noise when reversing, much like trucks and electric cars in the US. Note that these requirements have nothing to do with the local language or data; instead they are to facilitate the local way of conducting business.

Performance

Sometimes making changes for internationalization affects the performance of the software. While performance degradation is not an excuse for not internationalizing the code, it should be addressed. There are some strategies you can use to avoid severe performance hits.

Message catalog access often affects the time it takes for the product to start up or load, but usually doesn't cause significant degradation during regular processing. The exception is when user messages are constantly being accessed, building windows and screens frequently. If the product is dynamically building data screens throughout the entire process, it is worth checking to see if there is any major slowdown in localized versions. Pseudo-localization* is useful for testing this area. Caching or using static templates can often solve this problem.

* Pseudo-localization is the process of programmatically "translating" all the localizable resources of a product, then building and testing the product as if it were a real localization. It can uncover many bugs in the localizability of a product, such as in the mechanism for detecting the correct language files to load, in displaying any UI elements that have not been externalized into a localizable resource, in the expansion capabilities in the UI design, as well as in the performance in processing localized resources.

Another area of internationalization that can affect performance is the process of converting data from one charset into another. For example, if the product is searching data stored in a variety of charsets, it needs to convert all the data into the same charset in order to compare the search string, especially if fuzzy matching is needed. Depending on how much data is being converted, this could significantly affect search performance. Investigate ways of avoiding dynamic conversion, and instead convert all the data into Unicode before it is stored. Then only the search string needs to be converted at search time, and all the matching algorithms can be written with Unicode code points.

Pay attention to intensive text parsing processes, layout, and large amounts of locale-related formatting. Plan to test performance of the product with international data throughout the product, including user interface, data stores, templates, etc. Make sure that the locale settings on the machines in the performance test are varied to reflect a multinational customer configuration, or are set to a single locale which is not C or en_US. Having Asian language configurations loaded on a machine may affect performance, so try running tests on machines with Asian setups in an Asian locale.

In some extreme cases, it may help to provide configuration options which limit processing to certain charsets or locales, reducing the amount of conversion and formatting.

Localizability

As confusing as it may sound, localizability is an aspect of internationalization, not localization. It involves enabling the product to be easily localized, without manipulating code or having to edit files that contain code.

This can get a little tricky, as many products rely on code embedded in HTML, such as JavaScript and JSPs. Other products are starting to use XSL with embedded text. While using this method does not necessarily make localization more difficult, it does present a problem for patch releases or service pack updates. If any changes are made to the JavaScript, JSP, or XSL code in the files after the product is localized, then the localized elements must be reinserted into the updated files. Even if it is possible to use a tool to insert the translated text (which often isn't true), a separate localized patch or service pack must be issued, and all the accompanying process for patch releases must be followed. This adds significantly to localization costs, which are usually not budgeted.

Other areas of localizability are messages and UI elements; UI elements will be covered in later sections. Messages are the primary method for communicating with the userl; therefore it is imperative that they are not hard-coded so that they can be localized. The only code that should coexist with messages is enough to enable handling the messages as an external resource, such as the code in a Java ResourceBundle. Whether or not a product is to be localized is a decision that should be based on business issues, never technical limitations.

Some additional tips for localizability:

Never hard-code anything. Even constants, literals, and similar items can be defined in the definition section of a program, or in a header file. This is more than i18n, it's structured coding.

Do not restrict font size. Asian characters require a much larger point size for legibility than Latin characters. If a small font size is necessary to fit all the English text into a layout, then the layout needs to be redesigned. See the layout section for more information.

Be aware that the operating system(s) your program is running on may also be localized. Messages may be translated, even in system commands. If you need to parse the output of an OS command, change the environment locale for that command, e.g.

env LC_ALL=C date

All items, which may be viewed, created, changed, or used by the customer in any way should be put into separate resource files. This includes component labels, messages (status, error, help), command stackers, field separators, templates, and window titles, among other things. Command shortcuts and hotkeys need to be localizable; they too should be in resource files. Add comments to clarify.

Group resources logically, preferably by screen and function. The more contextually related resources are, the easier they are to localize. Add comments for the localizers; you can never have too many comments. These comments come in handy for program revisions as well.

Keep resource file and document organization consistent. As the product is revised, try to keep the same resource files where appropriate. Maintain the same document and help screen structure as much as possible. If there's a need to change the document or file organization, track the movement of the old text so that past translations can still be leveraged.

Determine field lengths by function whenever possible. Avoid fixing the lengths in the code. Strings and currency values will expand when localized, and data in other languages may require more space (imagine the price of something expressed in Turkish Lira, which is currently over TRL 1,500,000/USD 1.00). If there is a size limitation for a string or value, add a comment next to it stating the size.

Use and store messages in entire phrases; don't build them from pieces. This includes UI phrases formed by list box choices or numeric values in combination with fixed text. Word order and grammar change drastically from language to language. You may use placeholders for information which comes from the system (e.g. file names, available disk space) as long as they can be reordered - please add comments in the resource file as to what the placeholders represent.

Create a new message for each context, even if the wording is identical in English. Different contexts may require different translations in other languages.

Use Standard English when creating program and document text. Avoid slang, jargon, cutesy American expressions, and marketing hype. Beware of acronyms and abbreviations. In most products, the English UI must be internationally acceptable.

Track changes with each revision. Note which windows, chapters and help screens, etc. change. When making changes to a resource file, add new messages to the end of the related section. If an existing message changes semantically (as opposed to correcting English spelling or grammar), comment out the old message and add the changed one.

Data handling

Every time the program needs to do something with data, internationalization plays a role. Some of the processes applied to data that are affected by internationalization are:

  • Display
  • Read
  • Sort
  • Search
  • Parse
  • Compress
  • String format
  • Character index and count
  • Word wrap
  • Hyphenate
  • Numeric format
  • Date format
  • Protocol format

Display can involve charset conversion, font selection, font size, processing through a special rendering engine, positioning on the screen, string formatting, word wrap, hyphenation, numeric and date formatting, and other processes.

Read requires charset conversion, communication with input methods, string formatting, numeric and date formatting, protocol formatting, and more.

Sort, search, word wrap, hyphenate, and sometimes parse are language-sensitive operations.

Search, compress, string format, character index and count, word wrap, and hyphenate are especially sensitive to the charset of the data, since they are looking at individual characters. Of course, the charset should always be known.

Parse, string format, numeric format, and date format are particularly sensitive to the locale. However, the locale can affect other areas.

Protocol format must often include parameters describing the charset, language, and/or locale of the included data.

Here are more general guidelines for processing international data correctly:

Keep code 8-bit clean; do not use the high order bit of a character byte. Be careful when casting variables and moving values from unsigned to signed and back. This is only for the external data bits, not for flag bytes and other internal data.

Allow for the possibility of more than one language at a time. Check language attributes wherever they exist, and process accordingly. This should really be done per object, but at least try to do it by thread or process.

Determine field lengths by function whenever possible. As in localizability, avoid fixing the lengths in the code.

All string handling should be done with character-set-aware functions. Collation, parsing, character counts, formatting, alignment, fuzzy matching, and other string functions require knowledge of the charset being used and/or the current locale.

Use locale aware functions whenever possible. If the object itself does not have a locale associated with it, then there are other places to get the locale. Check the user preferences for the locale, if your software has a setting for it (which it should). When that's not available, poll the system for the current locale. Remember to get the locale relevant to the person viewing the data, namely, the client. Server locales will differ from client locales.

Query the paper and page preferences before formatting documents for printing. Page/paper size changes with locale as well.

Graphical images

Graphical images are expensive to create, and are almost as expensive to modify. English product is usually sold all over the world. Taking these two premises into account, product graphics should be universally acceptable if at all possible.

Here are some basic guidelines for creating universally acceptable graphics:

Human figures, body parts, and especially hands should be avoided at all costs!



The problem with human figures is manifold: is it female or male, what is it wearing, what color is its skin and hair, what position is the body in, what is the figure doing? Obviously in some cultures, certain types of dress are inappropriate, whereas they are standard in other cultures. People of one sex may not be allowed to perform certain tasks in some cultures, but in others they are the primary performers of these tasks. In different parts of the world, people identify with different skin and hair color on the figures. The only acceptable human figure is a stick figure with no clothes, no hands with fingers, and no hair.

With body parts, the difficulty lies not only in which body part is being represented, but what position it is in, where it is cut off, and how it is cut off.

Hands - don't even try. There's not a hand position around which isn't offensive somewhere. And there is no hand position with universal meaning. Really. Don't forget.

No animals should be used to represent anything other than the actual animal.

Consider this graphic:

OK for the USA, maybe, but not ideal for India!

And this one:

Is this a pet? Is it a farm animal? Is it food? Depends on where you are.

Animals are powerful symbols in many cultures, and there is no universal animal symbol template. Bottom line; don't use them unless you're representing the actual animal.

Puns on English words, pictorial representations of English words, and graphics containing words are not universal.

Representing an English word with a picture of something that shares the same word but has a different meaning does not translate. For example, in one product there was an icon for representing staging. The icon was a picture of a theater stage. While this works in English, it doesn't work in other languages.

Is this a home? How about this?

Is either one of them related to a home page on the Web?

Another more common example is using a picture of a musical note to represent a message note. Again, this makes no sense to people who do not speak English.

Text in graphics can be a real nightmare. If the product is to be localized, then the graphics have to be altered at great expense. Simply use some sort of image, and keep the text separate. Use numeric callouts and place the descriptions in text above or below the graphic. If there is no way around putting the text in the graphic, follow all these guidelines:

  1. Make sure there is plenty of expansion room for the text portion of the graphic. Translations into alphabetic languages can more than double in width, and ideographic languages tend to expand vertically.

  2. Allow the font size to change. If a small font is needed in order to make the text fit, then the graphic needs to be redesigned, since it will not be translatable.

  3. Separate the text from the graphic in layers, at least for sourcing purposes. Save the graphic without text as a sort of blank, and provide that to the localization team.

Some objects are culture-specific, so verify that a particular object used in a graphic is universally understood.

Take a look at this graphic:

What is it? Where is it used? Who has one that looks like this? The red flag is up - what does that mean? In real life? Online?

The answers are that this is a US rural mailbox. People who have this sort of mailbox do not need to mail their outgoing letters in an official post box. Instead, the postal carrier will pick up outgoing mail for them, as well as delivering the incoming mail. If someone has outgoing mail, they raise the red flag. The postal carrier will lower it after picking up the outgoing letters. But online, the raised red flag is used to indicate that there is newly delivered mail in the mailbox. So not only has a location-specific symbol been used, but also it has been used incorrectly.

This is a perfect example of an assumption that everyone in the world would understand that a picture of a US rural/suburban mailbox is a mailbox. The difficulty is finding a single object that would universally illustrate a mailbox. In this case, the shape of the mailbox cannot be meaningful - mailboxes around the world come in all shapes and sizes. Instead focus on the purpose of the mailbox, as a place to receive mail. Make the box simple, and put an obvious letter or stack of letters in it. A basic letter image is universally understood, so work from there.

Some objects would be found offensive to certain cultures - take this graphic for example:

While in some cultures alcohol indicates a celebration, in others it is against religious beliefs to consume alcohol. People from the cultures prohibiting alcohol might view the above image as sinful or degenerate, not usually the impression that products mean to portray. It's best to find another type of image to portray the meaning (unless the product is, in fact, wine!)

Make sure that a single icon is not used for multiple meanings.

While this sounds like an obvious statement, it is violated all the time. The most common example is this: . In a single product, it is used to indicate a link to help information, and a query that requires a response. In fact, it's been known to occur in a style guide with those two meanings. And while the context makes the icon understood, needing a context to understand the icon defeats the purpose of using an icon at all.

Color

Color means different things in different cultures. What does putting this text in red mean? Does it mean, "this statement is especially important"? Does it mean that the statement is meant as a caution or warning? Is it just calling out the statement as being special? Maybe it signifies that the statement is especially positive and good? The answer is, depends on the person reading it.

In some countries, red is a celebratory color, conveying a positive meaning. In Korea, if a person's name appears in red, it means they are deceased. White is usually associated with goodness and purity in US culture, but elsewhere means death. In addition, the distinctions between colors varies with the culture; the line between what is blue and what is green changes quite a bit between the US and Japan.

This is not to say that colors are not useful; they are. But remember that color alone cannot convey meaning; this is not just for i18n, but also for accessibility, since colorblind people will not be able to see the distinctions. It is best to use a consistent color scheme throughout a product, or better yet, throughout a line of products. Users will grow accustomed to the color scheme, e.g. red for errors, yellow for warnings, green for success.

Text

Text chosen for the UI should reflect an international English product. Avoid jargon, slang, Americanisms, cutesy phraseology, and humor. Humor does not translate well. Truly. Ha ha. Get it?

Samples and scenarios should be chosen carefully. For example, one product used a spy as a character in the tutorial, and Swedish customers found this very offensive. The best approach would be to talk to people from different cultures. Take advantage of the diversity of people in the office, as well as the field marketing and sales people.

Sound

Some sounds are culture specific. While the game show buzzer sound for incorrect answers is well known to people in the US, it is simply an unpleasant cacophonous noise with no meaning to those in other countries. In Japan, making a mistake on your computer can be personally embarrassing; broadcasting that mistake to your coworkers via a buzz or beep may cause shame. This does not boost product sales.

The best approach to including non-speech sounds is probably to make a variety of sounds and allow the user to select. There should always be an option to turn sound off. All sounds should be localizable.

Layout

Layout design must accommodate not only the fixed elements on the screen, but also the variable ones.

Fixed elements

For localizability, fixed elements must be arranged such that text can expand without requiring a great deal of rework. Alphabetic (e.g. Latin, Cyrillic, Hebrew) languages tend to expand horizontally, sometimes more than double the size of the English text. Chinese character based (e.g. Japanese, Chinese, and in this case, Korean) languages often expand vertically, since the characters are taller than Latin characters. Font sizes may need to be larger for other character sets. Allow for text expansion in all UI elements, including:

- field labels

- field separators

- titles

- user/error message areas

- buttons

- checkboxes

- radio buttons

- drop down lists

- table cells

- text in images*

*of course, there should be little to no text in images...

All elements must be not only translatable, but expandable and movable as well. It's not always possible to create a button length that makes visual sense in all languages. Consider for example the English word edit, which when translated into German becomes bearbeiten. For some screens, having a button large enough to accommodate bearbeiten would not work well for edit. Bear in mind, too, that other languages do not abbreviate as extensively as English, so abbreviation is not always a workaround. Some input method editors add an additional status line to the bottom of a window, so keep this in mind when choosing a window size.

The order of elements may need to change, especially in sorted lists. If the product has a list of radio button choices in alphabetical order, that order will likely change in a translated version. The tab order should also change to match the visual order.

Order should be a consideration in the UI design. If it's not necessary, it's easier to avoid forcing a particular order. One of the more difficult designs is indexing by letter of the alphabet. This is quite a common design, but for products being localized, it is not always easy to translate into non-alphabetic languages. If it appears after examining other possibilities that the index-by-letter design truly makes the most sense, check with the localization team before forging ahead with the design implementation.

Sentence order will change with different languages; so do not include a particular sentence order as part of the design. Not only does the order change, but the phrase breaks change as well, so simply allowing reordering may not be enough for a translation to look correct. For example, it's tempting to construct a calendar entry edit screen to have:

However, most languages would have to rearrange the fields, and some (in this case, the am/pm) are superfluous and need to be removed.

Keyboard shortcuts may also need to be manipulatable. If the reason for choosing a particular keyboard sequence is due to the keys' close proximity, then this may need to change for different keyboard layouts. For shortcuts using a mnemonic letter, these will change with translations.

Variable elements or user input fields

Fixed elements are not the only portions of the UI that change order or expand with use in other languages. User input areas also need more space for the data they input. The key difference here is that making the elements flexible for localization is not enough! English product is often sold all over the world, and the UI included with English product must accommodate input data from all over the world.

The most obvious design area is to make sure input areas are large enough to handle longer input text. This is a fairly straightforward requirement. Consider the Turkish Lira example, currently over TRL 1,500,000 to USD 1.00 - imagine the expansion needed in a currency field.

Another expansion consideration is the rendering of input text - the text area needs to be not only long enough for more and/or longer words, but also tall enough for larger fonts.

More complex than expansion is the consideration of universal data input field structure. For example, if a product allows the user to enter a date in a short format, how should the input area look?

The problem with forcing a date format is that it isn't universal. Even dates themselves aren't universal, although it isn't unusual for products to limit their capacity to the Gregorian calendar. But the mm/dd/yy format so common in the US is not used anywhere else, and is very confusing. It is better to allow a user in a known locale to enter a date format commonly used in that locale. If there is no way to know the locale, then the only acceptable universal date format is yyyy-mm-dd. The separator may be changed to a dot, or possibly a slash / but the rest of the fields must be in the specified order. So the date input field might look like:

Another possible solution to the format issue is to provide users with a choice of formats, for example in a preferences area. This way you can display the chosen format next to the field, and know exactly how to parse the input date format from the user.

Of course, there is the story of the Japanese emperor date. One product allowed for modification of the emperor name in the date field, trying to make the product as flexible as possible. The Japanese were offended, because that implied that the emperor would die. The moral of this story: universal design is a tricky business.

Other field types that should be considered very carefully are names, addresses, company information, currency, measurement, numeric values, and any other formatted data. Data formats are usually locale and/or culture specific. Once again, English product is sold all over the world, so just making the arrangement localizable is not enough. If the interface must be customized in order for the product to function properly, then create several locale profiles that can be loaded based on the user's locale. Or, less optimally, make it easily user customizable, and inform the customer that they are expected to customize the product for the locale.

One more very important consideration in layout design is orientation. Consider what will happen in your interface layout for a right-to-left language. If there are controls, they may need to switch sides. Titles, tables, table cells, and similar elements will need to be right aligned. Text on one side of an image will need to move to the other side. Some of these changes may need to be dynamic, basing the orientation of the layout on the locale or data language. One trick to help visualize what a design might look like in a right-to-left layout is to view that design in a mirror. Orientation is often so imbedded in a design that suddenly having to accommodate a right-to-left language requires a major code revision. Thinking about it ahead of time will allow you to serve more customers with less effort.

Command Line Interface (CLI)

The definition of CLI used here is something that a user can type on a shell command line.

  • The command itself is not usually localizable, nor are fixed parameters.

  • The data provided as arguments to commands and parameters may be in another charset, locale format, or other localized structure. Be prepared for all argument data.

  • Output of the command must be localizable. For example, even fixed data from the UNIX ps command has column headings. Output text needs to be transformed (converted) into the native charset of the command window. Fortunately, this does not usually apply to batch commands.

If the command parses output from another command, be aware that this output may be localized. Don't rely on English string literals which are not fixed names. Or force the locale to be en_US or C for the execution of that command, for example:

> env LC_ALL=fr date

vendredi, 6 octobre 2000, 18:24:02 PDT

> env LC_ALL=de date

Freitag, 6. Oktober 2000, 18:26:24 Uhr PDT

> env LC_ALL=it date

venerd?, 6 ottobre 2000, 18:27:23 PDT

Note: This locale change only works for the command that follows it. The system and shell environment variables remain unchanged.

Documentation

There are two main components of internationalization in documentation:

  1. Including international configuration information on the product in the documentation.

  2. Structuring the documentation for ease of localization.

The first component is all too often neglected. Engineers sometimes know that certain changes need to be made in various parts of the product in order to correctly handle international data. For customers, it's not a big problem to make a few extra configuration changes if that's all that's needed to properly process their data, but this information is usually not available in the mainstream documentation. If it's available at all, it's in the README, or on a Web site somewhere, or in a separate section in a different book, etc. The best location for configuring a particular component for international data is in the document addressing the configuration of that component. Have a subsection heading of International or something equally obvious, but have it located with the rest of the component documentation. In addition, all the international configuration information for all the components can be gathered together in a separate chapter called International or something similar. Both should be indexed under International. If engineering has not provided the information, ask them for it. Make sure special configuration needs for international are documented for every component.

The documentation itself should be in clear, straightforward, standard English. This not only eases the translator's task, resulting in a better translation, but also ensures that the English product sold into other countries is easily understood. Have a glossary documented, and make sure all the writers use the same glossary. Be as consistent with the UI as possible in terminology.

Be careful in choosing the software for creating documentation. If the style sheet uses the latest features from the latest version, then the translated versions may have to use a different style sheet. Often word processing software support in other languages lags behind the English version. Check with your localization representative for input on doc software.

Don't change text from version to version that doesn't need to change. In other words, even if the wording could stand some improvement, if the meaning is the same, don't change it. If text is changed in this way, usually to fix typos and grammatical problems, make sure it is clearly marked as not needing translation. Work with the localization team on the best way to manage this type of text.

Documentation can be the most expensive portion of the entire localization cost, and yet the product is not purchased because of the documentation. It is important to minimize localization costs wherever possible.

Myths and misconceptions

Myth #1: Making UI elements localizable is enough.

Many folks believe that making a UI element localizable is enough for an international product. If it were, it would mean that the product is modified for every country where it is sold. This would include Canada, the UK, Brazil, New Zealand, Greece, and dozens more. Obviously no company localizes products for every single country it sells into, or localization groups would be much, much larger and their budgets would be significantly bigger. Instead, companies sell the English product all over the world, with the exception of a few large markets where localized product is sold. Even the localized products are sold into multiple countries, for example French products are sold in Canada, Belgium, Switzerland, and parts of Africa. Yet no one expects that there are multiple French versions.

For this reason, the locale of the user should be detected or determined in some way, even if the user must be asked explicitly. Numeric formats, text formats, dates, and any other formatted data should appear in a style that is used in that locale. Note that values must be handled carefully. For example, if someone in Germany asks for a price, and the price is stored in US dollars, then there are two possible methods of conveying the value to that user.

  1. The currency unit displayed is US dollars, but the numeric format of the actual value is that of Germany:

USD 250 467,10

  1. The value is converted from US dollars into German Marks, and the value is displayed with the German Mark currency symbol, in a German numeric format:

DEM 528 450,47

Even with the value expressed in US dollars, the thousands separator is a blank, and the decimal is a comma. If the US thousands separator, the comma, is used, a German user might well be confused about the amount.

Formats should be locale sensitive, but value units should only change if there is a conversion.

Graphics are part of the universal product approach. They are so expensive to localize that no one usually bothers unless there's embedded text (which should be avoided). Graphical images should be universally appropriate.

Myth #2: Translators choose the best phrase in the target language.

Many folks assume the people translating the product will always choose the best word for the context. The truth is, localizations run on tight schedules and low budgets. Translators usually translate text directly in message catalogs, rather than as they appear on the screen. They are not well versed in product functionality, and there is little time and expertise to perform thorough linguistic checks of the text in the context of the running software. They are usually paid by the word, so volume is their watchword. Imagine what happens to the translation in this situation.

Myth #3: The code is in Java, and therefore it's internationalized.

Long before the advent of Java, there was internationalized code. How on Earth did programmers manage this? The answer is, internationalization was always possible, it just took more effort. Java is written to make internationalization much easier. However it is not impossible to write Java code that is not internationalized. In fact, it's pretty easy to write code that only supports English in the US in Java. So, even Java must be carefully coded to support international data.

Myth #4: The product has full Unicode support, and therefore is internationalized.

Like the Java myth, so goes the Unicode myth. It is true that, like Java, Unicode support can make handling international data much easier. But once again, code must be written to manage data in different languages, in different locales, and, for the time being, in different charsets. Ha ha (did that translate well?).

Myth #5: Administrative interfaces and log messages don't need to be internationalized.

This may come as a complete surprise, but, administrators are people too. In some markets, the admin interface must be localized. What was done in the past in localization is not necessarily what will be done in the future. Whether or not a product gets translated is a business decision, not a technical decision. Engineers need to enable business folks to make the decisions necessary to sell as much product as possible. This in turn makes the company more profitable, which raises the stock price (well, sometimes) and everyone benefits. Localization needs to be enabled throughout the product.

Log messages fall into a special category of messages. They are usually not localized directly, but may in fact be indirectly localized via a log viewer. When this is available, log messages need to be in a separate resource file in order to be localized. For this reason, log messages need to be localizable, but they need to be separated from other messages so that localization knows whether to translate them or not. If a message goes to both a log and the UI, and log messages are restricted to English, then the message going to the UI should be retrieved from the localized resource file, and the message going to the log should come from the English resource file. English files should be shipped with all localized products.

More information

The information on internationalization in architecture and design is somewhat fragmented, with some of it appearing in various Web pages, some in books, and some in the heads of engineers. Try joining the public i18n discussion list i18n­prog@yahoogroups.com. Sign up at http://groups.yahoo.com/group/i18n­prog or send a blank email to:

i18n-prog-subscribe@yahoogroups.com.

The Sun Global Application Developer Corner has lots of information on internationalization, including the Sun Internationalization Taxonomy Document, a matrix form and description that is designed to aid in assessing product i18n status.

Sun's Software Globalization Resource Site

For information on Java 1.4.2 internationalization, see the Web site:

http://java.sun.com/j2se/1.4.2/docs/guide/intl/

For information on the International Components for Unicode (ICU) library, see:

http://oss.software.ibm.com/icu

Andrea Vine is a software internationalization architect at Sun Microsystems. She can be reached at andrea.vine@sun.com.

Related Links