4.3.2.2 Strings (Encoding Methods and Transcoding)
Description
Strings are the primary text element processed by software. Programs need to:
To Determine String Boundaries
Finding the beginning of a string is fairly straightforward; essentially the first byte in a given parameter,
variable, or object can be safely assumed to be the start. However, determining where the string ends is much more difficult. You
can do this in any of the following ways:
- Look for a null (x'00') byte - This is effective in most cases, but there are certain Unicode character
encoding schemes (UTF-16, UCS-4, UCS-2) that contain nulls as part of a character.
- Use a length value - A pre-determined length value can be provided as an additional parameter, dictating the
number of bytes in the string.
- Look for a particular delimiter - While similar to inspecting for a null byte, this tends to be more difficult
to implement. If the delimiter is a single byte value outside of the range x'00'-x'7F', it can appear as part of a multibyte
character in many character encoding schemes. Even a delimiter in the range x'20'-x'7F' is embedded inside characters in several
7-bit encoding schemes and the entire 7-bit range x'00'-x'7F' is used to make up multibyte Unicode characters in UTF-16, UCS-4,
and UCS-2.
- Find a language-related punctuation mark or whitespace - For certain text processing products, this method of
boundary determination is basic functionality; However, this requires a tremendous amount of supporting information, including
language, charset, punctuation mark byte sequences per language/charset combination, and more. Programs must also handle the
parsing of textual data in different charsets.
If textual data is restricted to a certain charset, then it is possible to look for a particular
delimiter.
To Calculate String Length
String length can mean two different things: physical length in bytes and conceptual length in characters. Both
concepts of length are important to string handling. It is for the specific application to determine which length is needed at any
given point in the program. For applications that do not actually process the individual characters of a text string, length in
characters is probably not useful.
To Compare Strings
Strings are compared for a number of purposes. An input string can be matched against a list of actions to determine
whether a task has been initiated. User-entered search strings are compared against a body of text to find matching data. Strings
are collated based on the results of a comparison.
In order to successfully compare two strings, they must be in the same charset. Some programs work with a restricted
set of charsets, such as those covering the Japanese scripts. Both strings should be converted to the common charset, if they are
not already in it. In this context, the common charset should be the one that is a superset of all the possible charsets. For
software set up to work with any of the major charsets in the world, it is safest to choose an encoding of Unicode, such as UTF-8
or UTF-16, as the common charset.
Unicode needs special processing, however, due to its ability to represent the same character in several different
ways. For example, the character ü can be U+00FC or the combination U+0075 U+0308, but to a user, the character is the same
and should always match, regardless of the underlying values. To achieve the expected results, Unicode data must be
normalized; that is, only one of the representations for each character is allowed and the data is converted to that set of
representations. Obviously, there is more than one way to normalize the text, for example, the representation chosen for ü
could be either the single value U+00FC or the combination of values U+0075 U+0308. Unicode contains definitions of different
normalization forms. A program uses only one of the forms throughout its processing. For more information on Unicode
normalization, see the Unicode Technical Report #15.
Sometimes data requires another type of processing called canonicalization. Canonicalization is needed in
situations where two different characters must compare the same. An example is changing all the characters to lower-case for
case-insensitive matching. Not all writing systems have case, but there are many different forms of canonicalization. In Hebrew
for example, certain accents and points can be ignored for comparison.
To Move Strings From One Place to Another
Programs often move strings from one place to another. For example, user input strings are retrieved and stored in a
database. In some cases, the text must be converted from one character encoding to another during this process. If a product is
handling data in all the major charsets, it makes sense to store and process data in a Unicode encoding. So when the data is
retrieved from the user interface, it is first converted from its original charset to a Unicode encoding, and then stored in a
database.
Usually it is not enough to simply convert a string into a Unicode encoding and store it. If a user wants to retrieve
the string for viewing and does not have a configuration that supports the display of Unicode encodings, the string must be
converted into an appropriate charset. It might not be necessary to convert it back into its original charset, but it is important
to know what charset can support the characters in the string. Either the original charset, or more commonly, the language of the
data, should be stored with the string.
Command Line Interface
Strings can be parameters to the command on the command line. They can be in files that are taken as input or typed
in directly. They can also be returned as part of executing the command. Delimiters in this case are usually whitespace, though if
the strings are contained within a file, they could be delimited with some other designated character. String data can be
restricted to an encoding of Unicode or the default charset of the locale for the terminal window. For more information, see Section 4.3.1.
Character Interface
String data can be input into a character interface. The data is probably in the default charset of the current
locale, and needs to be handled accordingly (see Section 4.3.1). If the string is limited to a
specific byte length, special processing might be necessary to ensure that only entire character values of multibyte characters
are read into the string buffer. Output strings need more room for display in other languages and any display length truncation is
done at character divisions. Sorted output is displayed in the logical order for the locale.
Graphical Interface
Similar to the character interface, graphical interfaces take strings as input data. Usually, graphical interfaces
have more control over the charset of the input data. The length, delimiter and sort issues are the same as in character
interfaces.
Application Protocols
Protocols are used to transport strings from application to application. They either allow for specification of the
charset of the string data within the protocol, or require the data to be in a specific charset. Delimiters are also defined in
the protocol and should accommodate all allowed string data.
Storage and Interchange
Strings can be stored in their entirety or parsed into relevant pieces and stored. Either they are converted and
stored in a single specified charset (Unicode encoding) with language information, or they are stored along with their charset
identifier. The storage format specifies a delimiter. File formats accommodate string data in a similar way, either forcing a
Unicode encoding or taking the data in the locale encoding. Most file formats do not include charset information; this is managed
external to the file.
Application Programming Interfaces (APIs)
APIs specify string delimiters or a length. They can handle strings in all different charsets or restrict them to a
specific one. If the API converts from a charset into a Unicode encoding, it should follow one of the normalization forms in the
Unicode Technical Report #15. Canonicalization can also be
performed by the API and is part of the specification. Searching is conducted by the API on the normalized and in some cases
canonicalized strings.
Requirements for Compliance
For all interfaces, providers must not truncate strings in the middle of a multibyte character. They must normalize
string data that is to be compared.
For all display interfaces, consumers must display, search, and sort results in the locale-specific order of the user
or if that is not available, in the order of the language or locale of the string data itself.
Command Line Interface
Providers must specify whether a Unicode encoding or locale charsets are accepted. If locale charsets are accepted,
then all supported locale charsets must be accepted. Providers must specify the string delimiters used.
Consumers must supply string data in the charsets accepted by the provider. They must use the proper delimiters and
adhere to length limits set by the provider. If needed, they must include charset information.
Character Interface
Providers must accept string data in the locale charset.
Consumers must have sufficient space for full string display or a meaningful string truncation.
Graphical Interface
Providers must specify whether a Unicode encoding or locale charsets are accepted. If locale charsets are accepted,
then all supported locale charsets must be accepted.
Consumers must ensure that displayed strings are not truncated, or if necessary, are truncated in a meaningful
position.
Application Protocols
Providers must accommodate string data in supported locale charsets with charset identification, or specify a Unicode
encoding, and where relevant, the language of the string data.
Consumers must supply provider protocols with strings encoded in the appropriate charset and charset descriptions or
language data, where allowed in the protocol.
Storage and Interchange
Providers must be able to store strings either in all supported charsets, or in a specified Unicode encoding. They
must provide a mechanism for associating charset or language information in the case of storage.
Consumers must parse strings on appropriate boundaries, characters, words, or phrases, whichever is relevant. They
must include charset or language information as necessary for proper processing and retrieval. They must externally manage charset
information when it is not internal to the interchange format.
Application Programming Interfaces
Providers must specify whether strings are to be in platform supported charsets or in a Unicode encoding. They must
have parameters for charset or language information, where relevant. If converting from a charset into Unicode, they must follow
one of the normalization forms in the Unicode Technical Report #15 and specify which one. If search functions are provided, they must specify whether fuzzy matching is
performed and should allow for some configuration of fuzzy matching. If canonicalization is performed, the exact method must be
specified.
Consumers must provide charset or language information as necessary. Where relevant, they should specify the type of
fuzzy matching in a search.
|
|