Sun Java Solaris Communities My SDN Account Join SDN
 
Article

Multilingual Websites - The Questions To Ask

This article is mainly targeted at less technical web business owners or web managers. It aims to provide a list of questions you should ask your technical staff - or that technical staff should be asking themselves.
This article poses the following questions: Truly localizing content for a specific language or market can be a subtly complex task, and is beyond the scope of this article. This article instead focuses on the more measurable facets of deploying multilingual websites.
Which Character Encoding[s] Do We Use?
Your browser can usually tell you the character encoding that a particular page is using:
Mozilla: View -> Character Encoding
Internet Explorer: View -> Encoding


Traditionally, different spoken languages have had different character encodings associated with them. For example, Japanese content may have been encoded with SHIFT_JIS; Simplified Chinese with GB2312 etc. These language specific encodings are sometimes referred to as "native" encodings.

However, by using UTF-8 [a Unicode encoding], the native encodings are no longer required. UTF-8 supports nearly every spoken language. UTF-8 is also very well supported by all modern browsers.

UTF-8 greatly simplifies the task of hosting, maintaining, translating and testing multilingual web content.

If your pages are not using UTF-8, then your technical staff should be able to justify it.

A common reason for not adopting UTF-8 is the need to support legacy backend systems, which may use native encodings.
However, the most common reason is simply a lack of understanding of Unicode and UTF-8.
 
Have We Declared The Content Language?
If you have a page written in say, Korean, then your HTML source code should declare that [do View / Source]:

<html lang="kr">

or for Simplified Chinese:
<html lang="zh-CN">

This can be checked by viewing the page source.
It is something that is frequently overlooked by webmasters and translators.

Why bother?
  • Browsers use this information to choose the correct fonts.
  • It's good practice.
 
Have We Translated The Metadata?
When translating content, you will probably want search engines to properly index that content.

This requires that the relevant metadata is translated appropriately - especially keywords and description.

You can spot check for this by viewing the HTML source code.
It is important to note that you may not want some parts of the keywords and description metadata translated - such as product names.
Metadata translation is something that translation vendors have improved at, but sometimes they will overlook it entirely.
 
Are We Correctly Handling Form Input?
In form text input fields, users can enter content in whichever language they want.
Often, form content is redisplayed to users - for example in forums or searches.

Your form handlers [the programs that process form input] should be able to handle input in any language. At the very least, the data should not get corrupted.

Most forms can be simply tested to see how they handle non-ASCII characters:
  1. Go to, say, kr.sun.com.
    Ensure that you can properly view all the characters. [If you cannot, then it means you do not have adequate font support on your computer, and this test will be meaningless.]
  2. Copy any single Korean word from this page and paste it into a text field in the form you wish to test.
  3. Click 'Submit' on your test form.
    1. Result: Hopefully the form handler redisplays your input, or generates an email with the form contents. That way, you can check if the content is displayed properly.
      If it is not redisplayed properly, then something was corrupted during form processing and/or data storage/retrieval.
 
Can Our Databases Handle All Languages?
In a multilingual environment, databases should use an encoding that can handle all of the languages it is required to handle.The obvious choice is UTF-8.

Note, that databases typically define the width of character fields in bytes, not characters.
1,000 English characters = 1,000 bytes; but 1,000 Asian characters could be as much as 4,000 bytes.

For a detailed technical discussion on maintaining character encoding integrity between the client, the application and the database, please read this article.
 
Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.