Sun Java Solaris Communities My SDN Account Join SDN
 
Architecture, Design and Testing

Sun Software Product Internationalization Taxonomy

 
  « Previous | Contents | Next »
 

4.2.4.1 Lexical and Grammatical


Description

The following table describes terms that are used in this section.
Table 4-8. Terms and Definitions
Term Description
wordBasic sub-structure of a sentence which has meaning.
lexicalRelates to the words and vocabulary of a language.
morphemeUnit of meaning which can be a word or part of a word.
morphologyStudy of the structure and content of words.
grammaticalRelates to the way words are put together in a language.
syntacticRelates to the way words are put together to form phrases or sentences.
semanticMeaning of a sentence or word, based on the situation and the people, not just grammar.
When used in the same context, "lexical" and "grammatical" refer to language processing on a level higher than the character level. This section examines words and sentences as input and output. The following issues relate to lexical and grammatical processing:
Formatted Sentences
Modern applications often produce message strings, such as:
name + "will be on the " + time + " bus to" + city
where name, time, and city are calculated somewhere else and put into a predefined sentence structure. This is a grammatical or syntactic issue. In other languages, the order of these five items might be rearranged. The nature of the time variable might be different for each locale. To display sentences of this type for different locales, strings must use formatted output.
Formatted Words
Applications can process words themselves. For example, to form the plural of English words, the application can add the appropriate suffix. Other languages have different structures, for example, Arabic has singular, dual, and plural forms; China has no plurals. Words in different languages must be manipulated in different ways.
Spell Checkers
If users are writing documents in two or more languages they require spell checkers for different languages. Users might need to access multiple dictionaries at the same time, for example, Spanish and English. The dictionary should not depend on any particular encoding. Spelling checkers need to take into consideration lexical structures like suffixes or prefixes. In English, for example, a spelling checker might remove the prefix un- from a word before looking it up. For information on issues that affect other languages, see "Word Breaking." The spell checker must be generic enough to handle these issues.
Word and Character Count
Some languages do not have space boundaries and characters can be compound or multi-byte. Chinese, for example, presents word count problems, as characters are usually counted. For counting words, a dictionary look-up mechanism is required.
Word Breaking and Hyphenation
In English, word boundaries are marked by white space. This is not true in other languages, such as Thai. While breaking along word boundaries can be accomplished using dictionary look-up or white space delimiters, breaking in the middle of a word for hyphenation is more difficult. For some languages, it might not be a serious problem if a word is broken in the wrong place. In complex text layout (CTL) languages, however, it can change the way the word is rendered. Consider, for example, bi-directional languages and multi-language texts. Users might not always work in a single language. In Chinese, hyphenation is not a problem, because words can be broken along any character boundary.
Justification and Orientation
Some languages, like Arabic and Hebrew, are bi-directional languages. Numbers are read from left to right, words are read from right to left. If there is left-to-right language text, it should be read from left to right. Think about how things are justified. Chinese and Japanese are often written vertically and parentheses and punctuation are re-oriented. Embedded text in other languages is also rotated in vertical writing.
Grammar Checkers
Grammar varies widely from language to language. In some languages, every word in a sentence has a contextual suffix, in others, suffixes are never appended. When designing and coding a grammar checker, use the greatest common denominator. If you have word class files, make them generic so that grammar checkers for other languages can be used. Have a parser that acts as a base engine with a plug-in module for each supported language.
Complex Text Layout (CTL)
CTL is a lexical issue. Text can be input, which is based on sound, character shapes, or even character position. From the input, a character output can be generated in the form of single characters, groups of characters, or words. As more input is received, the output can change based on the new context. For more information, see section 4.3.3.3.
Input Method Framework
In the past, input method frameworks were considered a character issue. For Chinese, Japanese, and Korean, input methods took several characters as input and produced a single character output. Now they are, in reality, a lexical issue; they can take character input and produce multi-character or word output. They are responsible for managing CTL. For more information, see section 4.3.3.2.
Bi-directional Text
Bi-directional processing, sometimes known as bi-di, is also a lexical issue. The system must know what kind of morpheme it is dealing with, so that it can justify and orient it in the right direction. In the case of a Hebrew string, which includes a price and product name in Latin script, the text begins on the right side of the screen for the Hebrew text and continues right to left. Once the Latin text is encountered, the direction switches from left to right. For rendering, this means displaying the first character of the Latin text, then moving it to the left. The next character is then displayed to the right of it. This continues until more Hebrew text is encountered. So, using uppercase for Hebrew and lowercase and numbers for Latin, the text "THIS IS HEBREW this is latin $10.95 MORE HEBREW." goes through the following stages:
...
WERBEH SI SIHT
t WERBEH SI SIHT
th WERBEH SI SIHT
...
.this is latin $10 WERBEH SI SIHT
this is latin $10.9 WERBEH SI SIHT
...
.WERBEH EROM this is latin $10.95 WERBEH SI SIHT
For the data stream, however, characters are in logical order:
Data comes in from this direction ===>
.WERBEH EROM 59.01$ nital si siht WERBEH SI SIHT

Command Line Interface

The command line accepts lexical and syntactic data as input and returns lexical and syntactic data.

Character Interface

A character interface, like the command line, takes character input and produces character output. For lexical issues, the boundary between character and word is blurred. This can involve input method editors for languages, such as Chinese, Japanese, and Korean, and complex text layout (CTL) for languages, such as Tamil and Thai. CTL is part of the morphological level as well as the character level.

Graphical Interface

For graphical interfaces, lexical and grammatical data can be layered onto graphical objects, adding a layer of complexity to lexical and grammatical handling.

Application Protocols

Protocols can include strings of words as part of the protocol stream or identify sentence data.

Storage and Interchange

Storage and interchange formats can encapsulate word or sentence data.

Application Programming Interfaces (APIs)

APIs can take lexical or syntactic data as parameters to calls and return lexical or syntactic data.

Requirements for Compliance

In general, providers must supply functions that can accommodate the lexical and syntactic manipulation needs of the consumer. Consumers must use provider functions and manage the word and sentence structure so as to accommodate data in any of the provider locales. This means that the consumer must supply the provider functions with required information for correctly processing the data.

Command Line Interface

Providers must supply parsing functions for reading in and returning lexical and syntactic data to the command line in any language they support.
Consumers must use provider supplied lexical and syntactic functions, making sure to accommodate the lexical and syntactic structures of other languages.

Character Interface

Providers must supply input method (word forming) functions for reading in and returning word and sentence data to the character interface in any character set they support.
Consumers must use provider supplied input method (word forming) functions, making sure to accommodate multi-byte characters for input and output, as well as single byte.

Graphical Interface

Providers must supply lexical functions for managing character data with various elements of the graphical user interface (GUI), such as buttons, drop-down lists, and title bars. These functions must accommodate all supported character sets.
Consumers must use provider functions for creating the GUI, supplying language data for proper handling.

Application Protocols

Providers must construct the protocol so as to accommodate the necessary lexical data in a specified format.
Consumers must implement the protocol with all related character information, including charset, language, and locale.

Storage and Interchange

Providers must allow for storage of any lexical and syntactic data, supplying formats that contain relevant information for proper retrieval.
Consumers must include all relevant information in the storage and interchange formats so that lexical and syntactic data in any charset can be properly retrieved.

Application Programming Interfaces

Providers must supply interfaces that accommodate any lexical and syntactic data, where relevant.
Consumers must supply relevant lexical or semantic data descriptions to the API functions to properly process lexical or syntactic data.
  « Previous | Contents | Next »
 
Related Links