Sun Java Solaris Communities My SDN Account Join SDN
 
Architecture, Design and Testing

Sun Software Product Internationalization Taxonomy

 
  « Previous | Contents | Next »
 

4.3.2.2 Strings (Encoding Methods and Transcoding)


Description

Strings are the primary text element processed by software. Programs need to:
To Determine String Boundaries
Finding the beginning of a string is fairly straightforward; essentially the first byte in a given parameter, variable, or object can be safely assumed to be the start. However, determining where the string ends is much more difficult. You can do this in any of the following ways:
  • Look for a null (x'00') byte - This is effective in most cases, but there are certain Unicode character encoding schemes (UTF-16, UCS-4, UCS-2) that contain nulls as part of a character.


  • Use a length value - A pre-determined length value can be provided as an additional parameter, dictating the number of bytes in the string.


  • Look for a particular delimiter - While similar to inspecting for a null byte, this tends to be more difficult to implement. If the delimiter is a single byte value outside of the range x'00'-x'7F', it can appear as part of a multibyte character in many character encoding schemes. Even a delimiter in the range x'20'-x'7F' is embedded inside characters in several 7-bit encoding schemes and the entire 7-bit range x'00'-x'7F' is used to make up multibyte Unicode characters in UTF-16, UCS-4, and UCS-2.


  • Find a language-related punctuation mark or whitespace - For certain text processing products, this method of boundary determination is basic functionality; However, this requires a tremendous amount of supporting information, including language, charset, punctuation mark byte sequences per language/charset combination, and more. Programs must also handle the parsing of textual data in different charsets.
If textual data is restricted to a certain charset, then it is possible to look for a particular delimiter.
To Calculate String Length
String length can mean two different things: physical length in bytes and conceptual length in characters. Both concepts of length are important to string handling. It is for the specific application to determine which length is needed at any given point in the program. For applications that do not actually process the individual characters of a text string, length in characters is probably not useful.
To Compare Strings
Strings are compared for a number of purposes. An input string can be matched against a list of actions to determine whether a task has been initiated. User-entered search strings are compared against a body of text to find matching data. Strings are collated based on the results of a comparison.
In order to successfully compare two strings, they must be in the same charset. Some programs work with a restricted set of charsets, such as those covering the Japanese scripts. Both strings should be converted to the common charset, if they are not already in it. In this context, the common charset should be the one that is a superset of all the possible charsets. For software set up to work with any of the major charsets in the world, it is safest to choose an encoding of Unicode, such as UTF-8 or UTF-16, as the common charset.
Unicode needs special processing, however, due to its ability to represent the same character in several different ways. For example, the character ü can be U+00FC or the combination U+0075 U+0308, but to a user, the character is the same and should always match, regardless of the underlying values. To achieve the expected results, Unicode data must be normalized; that is, only one of the representations for each character is allowed and the data is converted to that set of representations. Obviously, there is more than one way to normalize the text, for example, the representation chosen for ü could be either the single value U+00FC or the combination of values U+0075 U+0308. Unicode contains definitions of different normalization forms. A program uses only one of the forms throughout its processing. For more information on Unicode normalization, see the Unicode Technical Report #15.
Sometimes data requires another type of processing called canonicalization. Canonicalization is needed in situations where two different characters must compare the same. An example is changing all the characters to lower-case for case-insensitive matching. Not all writing systems have case, but there are many different forms of canonicalization. In Hebrew for example, certain accents and points can be ignored for comparison.
To Move Strings From One Place to Another
Programs often move strings from one place to another. For example, user input strings are retrieved and stored in a database. In some cases, the text must be converted from one character encoding to another during this process. If a product is handling data in all the major charsets, it makes sense to store and process data in a Unicode encoding. So when the data is retrieved from the user interface, it is first converted from its original charset to a Unicode encoding, and then stored in a database.
Usually it is not enough to simply convert a string into a Unicode encoding and store it. If a user wants to retrieve the string for viewing and does not have a configuration that supports the display of Unicode encodings, the string must be converted into an appropriate charset. It might not be necessary to convert it back into its original charset, but it is important to know what charset can support the characters in the string. Either the original charset, or more commonly, the language of the data, should be stored with the string.

Command Line Interface

Strings can be parameters to the command on the command line. They can be in files that are taken as input or typed in directly. They can also be returned as part of executing the command. Delimiters in this case are usually whitespace, though if the strings are contained within a file, they could be delimited with some other designated character. String data can be restricted to an encoding of Unicode or the default charset of the locale for the terminal window. For more information, see Section 4.3.1.

Character Interface

String data can be input into a character interface. The data is probably in the default charset of the current locale, and needs to be handled accordingly (see Section 4.3.1). If the string is limited to a specific byte length, special processing might be necessary to ensure that only entire character values of multibyte characters are read into the string buffer. Output strings need more room for display in other languages and any display length truncation is done at character divisions. Sorted output is displayed in the logical order for the locale.

Graphical Interface

Similar to the character interface, graphical interfaces take strings as input data. Usually, graphical interfaces have more control over the charset of the input data. The length, delimiter and sort issues are the same as in character interfaces.

Application Protocols

Protocols are used to transport strings from application to application. They either allow for specification of the charset of the string data within the protocol, or require the data to be in a specific charset. Delimiters are also defined in the protocol and should accommodate all allowed string data.

Storage and Interchange

Strings can be stored in their entirety or parsed into relevant pieces and stored. Either they are converted and stored in a single specified charset (Unicode encoding) with language information, or they are stored along with their charset identifier. The storage format specifies a delimiter. File formats accommodate string data in a similar way, either forcing a Unicode encoding or taking the data in the locale encoding. Most file formats do not include charset information; this is managed external to the file.

Application Programming Interfaces (APIs)

APIs specify string delimiters or a length. They can handle strings in all different charsets or restrict them to a specific one. If the API converts from a charset into a Unicode encoding, it should follow one of the normalization forms in the Unicode Technical Report #15. Canonicalization can also be performed by the API and is part of the specification. Searching is conducted by the API on the normalized and in some cases canonicalized strings.

Requirements for Compliance

For all interfaces, providers must not truncate strings in the middle of a multibyte character. They must normalize string data that is to be compared.
For all display interfaces, consumers must display, search, and sort results in the locale-specific order of the user or if that is not available, in the order of the language or locale of the string data itself.

Command Line Interface

Providers must specify whether a Unicode encoding or locale charsets are accepted. If locale charsets are accepted, then all supported locale charsets must be accepted. Providers must specify the string delimiters used.
Consumers must supply string data in the charsets accepted by the provider. They must use the proper delimiters and adhere to length limits set by the provider. If needed, they must include charset information.

Character Interface

Providers must accept string data in the locale charset.
Consumers must have sufficient space for full string display or a meaningful string truncation.

Graphical Interface

Providers must specify whether a Unicode encoding or locale charsets are accepted. If locale charsets are accepted, then all supported locale charsets must be accepted.
Consumers must ensure that displayed strings are not truncated, or if necessary, are truncated in a meaningful position.

Application Protocols

Providers must accommodate string data in supported locale charsets with charset identification, or specify a Unicode encoding, and where relevant, the language of the string data.
Consumers must supply provider protocols with strings encoded in the appropriate charset and charset descriptions or language data, where allowed in the protocol.

Storage and Interchange

Providers must be able to store strings either in all supported charsets, or in a specified Unicode encoding. They must provide a mechanism for associating charset or language information in the case of storage.
Consumers must parse strings on appropriate boundaries, characters, words, or phrases, whichever is relevant. They must include charset or language information as necessary for proper processing and retrieval. They must externally manage charset information when it is not internal to the interchange format.

Application Programming Interfaces

Providers must specify whether strings are to be in platform supported charsets or in a Unicode encoding. They must have parameters for charset or language information, where relevant. If converting from a charset into Unicode, they must follow one of the normalization forms in the Unicode Technical Report #15 and specify which one. If search functions are provided, they must specify whether fuzzy matching is performed and should allow for some configuration of fuzzy matching. If canonicalization is performed, the exact method must be specified.
Consumers must provide charset or language information as necessary. Where relevant, they should specify the type of fuzzy matching in a search.
  « Previous | Contents | Next »
 
Related Links