4.3.1 Writing System Negotiations, Defaults, and Selection
Description
Software processes data. If the text is data, the most basic task for the software is to identify the writing system
to which the text belongs. In other words, a program must know the character set of the textual data in order to properly
interpret the bytecodes. In the past, programs assumed the data to be in a particular character encoding scheme. This assumption
is not possible for a global software product. Programs must find out what character encoding scheme is being used, and from
there, the writing system can be determined. Given no information, it is not possible to determine the character set by
programmatically inspecting the bytestream.
For the most part, knowing the character encoding scheme is enough for rendering purposes. In some cases, further
determination must be made from the code points used to correctly render the text. For example, if the UTF-8 character encoding
scheme is used to represent Arabic text, the rendering process must first know that the text is in UTF-8, and then check the code
points actually used in the text. Once it finds that the code points are in the Arabic range, a special Arabic rendering process
must be called, since Arabic has a range of shapes for each code point depending on word position.
The method for figuring out which character encoding scheme or charset is being used is different for particular
situations. For example, in HTML forms, the charset of the input data is the same as that set for the page; however, if the charset
of the page is not explicitly declared in an HTTP header or META tag, the data might be in the default charset set for the
browser. For files on a Solaris machine, the data can be in any charset supported by Solaris. In databases, the charset is usually
defined. The best situation is for the client to force the data coming from the user into a particular charset and then convert it
to Unicode. The server can then use Unicode internally; however, this is not always possible.
Both Solaris and Windows locales have associated charsets and can be queried for the currently active charset. Many
interfaces depend on the locale charset and use a system call to find out which one is active.
Command Line Interface
Commands take in textual data as parameters and return textual data. The charset of this data can be the system
locale charset or one specified in another command parameter. Rendering is handled by the terminal window, which determines the
necessary font glyph associations. In some situations, text is not expected to be rendered correctly, for example, text which is
inserted into a database, and which is normally retrieved through a graphical interface. A user might use a terminal that does
not render that particular charset, but the bytecodes can still be entered.
Character Interface
The rendering system needs to handle the charset used in the character interface, and so must be informed. In
addition, input fields in the interface are in a charset that must be determined.
Graphical Interface
Graphical interfaces usually have text that can be displayed in some charset, as well as text input areas. Several
mechanisms in graphical interfaces allow for explicit charset identification. Rendering is much more sophisticated due to the more
flexible nature of graphics. A variety of font options are available. These can be set by the application or by the
user.
Application Protocols
If an application protocol includes textual data, then either the protocol specifies what charset the data is in, such
as in LDAPv3, or the protocol contains a charset parameter, such as in HTTP.
Storage and Interchange
Storage and interchange are similar to application protocols; either they are defined as having text in a particular
charset, or they allow for charset specification.
Application Programming Interfaces (APIs)
APIs can use the system charset, require a particular charset, or allow a parameter to specify the charset. Which one
is used depends on the functionality provided.
Requirements for Compliance
In all cases, the charset of the text must be unambiguous.
Command Line Interface
Providers must state the method of charset determination. They should provide as much flexibility as functionally
needed, not always relying on the system charset.
Consumers must be aware of the provider charset determination method and supply a charset parameter, where
relevant.
Character Interface
See "Command Line Interface."
Graphical Interface
Providers must allow for charset specification by the consumer.
Consumers must supply providers with the text charset.
Application Protocols
Providers must either allow for a charset value or clearly state the assumed charset. Recommend using an encoding of
Unicode if only a single charset is allowed.
Consumers must supply the charset value or convert the data into the assumed charset before using the
protocol.
Storage and Interchange
See "Application Protocols."
Application Programming Interfaces
Providers must state the method of charset determination. They should provide as much flexibility as functionally
needed, not always relying on the system charset.
Consumers must be aware of the provider charset determination method and supply a charset parameter, where
relevant.
|
|