Sun Java Solaris Communities My SDN Account Join SDN
 
Architecture, Design and Testing

Sun Software Product Internationalization Taxonomy

 
  « Previous | Contents | Next »
 

4.3.1 Writing System Negotiations, Defaults, and Selection


Description

Software processes data. If the text is data, the most basic task for the software is to identify the writing system to which the text belongs. In other words, a program must know the character set of the textual data in order to properly interpret the bytecodes. In the past, programs assumed the data to be in a particular character encoding scheme. This assumption is not possible for a global software product. Programs must find out what character encoding scheme is being used, and from there, the writing system can be determined. Given no information, it is not possible to determine the character set by programmatically inspecting the bytestream.
For the most part, knowing the character encoding scheme is enough for rendering purposes. In some cases, further determination must be made from the code points used to correctly render the text. For example, if the UTF-8 character encoding scheme is used to represent Arabic text, the rendering process must first know that the text is in UTF-8, and then check the code points actually used in the text. Once it finds that the code points are in the Arabic range, a special Arabic rendering process must be called, since Arabic has a range of shapes for each code point depending on word position.
The method for figuring out which character encoding scheme or charset is being used is different for particular situations. For example, in HTML forms, the charset of the input data is the same as that set for the page; however, if the charset of the page is not explicitly declared in an HTTP header or META tag, the data might be in the default charset set for the browser. For files on a Solaris machine, the data can be in any charset supported by Solaris. In databases, the charset is usually defined. The best situation is for the client to force the data coming from the user into a particular charset and then convert it to Unicode. The server can then use Unicode internally; however, this is not always possible.
Both Solaris and Windows locales have associated charsets and can be queried for the currently active charset. Many interfaces depend on the locale charset and use a system call to find out which one is active.

Command Line Interface

Commands take in textual data as parameters and return textual data. The charset of this data can be the system locale charset or one specified in another command parameter. Rendering is handled by the terminal window, which determines the necessary font glyph associations. In some situations, text is not expected to be rendered correctly, for example, text which is inserted into a database, and which is normally retrieved through a graphical interface. A user might use a terminal that does not render that particular charset, but the bytecodes can still be entered.

Character Interface

The rendering system needs to handle the charset used in the character interface, and so must be informed. In addition, input fields in the interface are in a charset that must be determined.

Graphical Interface

Graphical interfaces usually have text that can be displayed in some charset, as well as text input areas. Several mechanisms in graphical interfaces allow for explicit charset identification. Rendering is much more sophisticated due to the more flexible nature of graphics. A variety of font options are available. These can be set by the application or by the user.

Application Protocols

If an application protocol includes textual data, then either the protocol specifies what charset the data is in, such as in LDAPv3, or the protocol contains a charset parameter, such as in HTTP.

Storage and Interchange

Storage and interchange are similar to application protocols; either they are defined as having text in a particular charset, or they allow for charset specification.

Application Programming Interfaces (APIs)

APIs can use the system charset, require a particular charset, or allow a parameter to specify the charset. Which one is used depends on the functionality provided.

Requirements for Compliance

In all cases, the charset of the text must be unambiguous.

Command Line Interface

Providers must state the method of charset determination. They should provide as much flexibility as functionally needed, not always relying on the system charset.
Consumers must be aware of the provider charset determination method and supply a charset parameter, where relevant.

Character Interface

See "Command Line Interface."

Graphical Interface

Providers must allow for charset specification by the consumer.
Consumers must supply providers with the text charset.

Application Protocols

Providers must either allow for a charset value or clearly state the assumed charset. Recommend using an encoding of Unicode if only a single charset is allowed.
Consumers must supply the charset value or convert the data into the assumed charset before using the protocol.

Storage and Interchange

See "Application Protocols."

Application Programming Interfaces

Providers must state the method of charset determination. They should provide as much flexibility as functionally needed, not always relying on the system charset.
Consumers must be aware of the provider charset determination method and supply a charset parameter, where relevant.
  « Previous | Contents | Next »
 
Related Links