4.2.4.1 Lexical and Grammatical
Description
The following table describes terms that are used in this section.
Table 4-8. Terms and Definitions
| Term | Description |
| word | Basic sub-structure of a sentence which has meaning. |
| lexical | Relates to the words and vocabulary of a language. |
| morpheme | Unit of meaning which can be a word or part of a word. |
| morphology | Study of the structure and content of words. |
| grammatical | Relates to the way words are put together in a language. |
| syntactic | Relates to the way words are put together to form phrases or sentences. |
| semantic | Meaning of a sentence or word, based on the situation and the people, not just
grammar. |
When used in the same context, "lexical" and "grammatical" refer to language processing on a level higher than the
character
level. This section examines words and sentences as input and output. The following issues relate to lexical and grammatical
processing:
Formatted Sentences
Modern applications often produce message strings, such as:
name + "will be on the " + time + " bus to" + city
where name, time, and city are calculated somewhere else and put into a predefined sentence structure. This is
a
grammatical or syntactic issue. In other languages, the order of these five items might be rearranged. The nature of the time
variable might be different for each locale. To display sentences of this type for different locales, strings must use formatted
output.
Formatted Words
Applications can process words themselves. For example, to form the plural of English words, the application can add the
appropriate suffix. Other languages have different structures, for example, Arabic has singular, dual, and plural forms; China has
no
plurals. Words in different languages must be manipulated in different ways.
Spell Checkers
If users are writing documents in two or more languages they require spell checkers for different languages. Users might
need to access multiple dictionaries at the same time, for example, Spanish and English. The dictionary should not depend on any
particular encoding. Spelling checkers need to take into consideration lexical structures like suffixes or prefixes. In English, for
example, a spelling checker might remove the prefix un- from a word before looking it up. For information on issues that
affect other languages, see "Word Breaking." The spell checker must be generic enough to handle these
issues.
Word and Character Count
Some languages do not have space boundaries and characters can be compound or multi-byte. Chinese, for example, presents
word count problems, as characters are usually counted. For counting words, a dictionary look-up mechanism is required.
Word Breaking and Hyphenation
In English, word boundaries are marked by white space. This is not true in other languages, such as Thai. While breaking
along word boundaries can be accomplished using dictionary look-up or white space delimiters, breaking in the middle of a word for
hyphenation is more difficult. For some languages, it might not be a serious problem if a word is broken in the wrong place. In
complex text layout (CTL) languages, however, it can change the way the word is rendered. Consider, for example, bi-directional
languages and multi-language texts. Users might not always work in a single language. In Chinese, hyphenation is not a problem,
because words can be broken along any character boundary.
Justification and Orientation
Some languages, like Arabic and Hebrew, are bi-directional languages. Numbers are read from left to right, words are
read
from right to left. If there is left-to-right language text, it should be read from left to right. Think about how things are
justified. Chinese and Japanese are often written vertically and parentheses and punctuation are re-oriented. Embedded text in other
languages is also rotated in vertical writing.
Grammar Checkers
Grammar varies widely from language to language. In some languages, every word in a sentence has a contextual suffix, in
others, suffixes are never appended. When designing and coding a grammar checker, use the greatest common denominator. If you have
word class files, make them generic so that grammar checkers for other languages can be used. Have a parser that acts as a base
engine with a plug-in module for each supported language.
Complex Text Layout (CTL)
CTL is a lexical issue. Text can be input, which is based on sound, character shapes, or even character position. From
the input, a character output can be generated in the form of single characters, groups of characters, or words. As more input is
received, the output can change based on the new context. For more information, see section 4.3.3.3.
Input Method Framework
In the past, input method frameworks were considered a character issue. For Chinese, Japanese, and Korean, input methods
took several characters as input and produced a single character output. Now they are, in reality, a lexical issue; they can take
character input and produce multi-character or word output. They are responsible for managing CTL. For more
information, see section 4.3.3.2.
Bi-directional Text
Bi-directional processing, sometimes known as bi-di, is also a lexical issue. The system must know what kind of
morpheme it is dealing with, so that it can justify and orient it in the right direction. In the case of a Hebrew string, which
includes a price and product name in Latin script, the text begins on the right side of the screen for the Hebrew text and continues
right to left. Once the Latin text is encountered, the direction switches from left to right. For rendering, this means displaying
the first character of the Latin text, then moving it to the left. The next character is then displayed to the right of it. This
continues until more Hebrew text is encountered. So, using uppercase for Hebrew and lowercase and numbers for Latin, the text "THIS
IS HEBREW this is latin $10.95 MORE HEBREW." goes through the following stages:
...
WERBEH SI SIHT
t WERBEH SI SIHT
th WERBEH SI SIHT
...
.this is latin $10 WERBEH SI SIHT
this is latin $10.9 WERBEH SI SIHT
...
.WERBEH EROM this is latin $10.95 WERBEH SI SIHT
For the data stream, however, characters are in logical order:
Data comes in from this direction ===>
.WERBEH EROM 59.01$ nital si siht WERBEH SI SIHT
Command Line Interface
The command line accepts lexical and syntactic data as input and returns lexical and syntactic data.
Character Interface
A character interface, like the command line, takes character input and produces character output. For lexical issues,
the boundary between character and word is blurred. This can involve input method editors for languages, such as Chinese, Japanese,
and Korean, and complex text layout (CTL) for languages, such as Tamil and Thai. CTL is part of the morphological level as well as
the character level.
Graphical Interface
For graphical interfaces, lexical and grammatical data can be layered onto graphical objects, adding a layer of
complexity to lexical and grammatical handling.
Application Protocols
Protocols can include strings of words as part of the protocol stream or identify sentence data.
Storage and Interchange
Storage and interchange formats can encapsulate word or sentence data.
Application Programming Interfaces (APIs)
APIs can take lexical or syntactic data as parameters to calls and return lexical or syntactic data.
Requirements for Compliance
In general, providers must supply functions that can accommodate the lexical and syntactic manipulation needs of the
consumer. Consumers must use provider functions and manage the word and sentence structure so as to accommodate data in any of the
provider locales. This means that the consumer must supply the provider functions with required information for correctly processing
the data.
Command Line Interface
Providers must supply parsing functions for reading in and returning lexical and syntactic data to the command line in
any language they support.
Consumers must use provider supplied lexical and syntactic functions, making sure to accommodate the lexical and
syntactic structures of other languages.
Character Interface
Providers must supply input method (word forming) functions for reading in and returning word and sentence data to the
character interface in any character set they support.
Consumers must use provider supplied input method (word forming) functions, making sure to accommodate multi-byte
characters for input and output, as well as single byte.
Graphical Interface
Providers must supply lexical functions for managing character data with various elements of the graphical user
interface (GUI), such as buttons, drop-down lists, and title bars. These functions must accommodate all supported character sets.
Consumers must use provider functions for creating the GUI, supplying language data for proper handling.
Application Protocols
Providers must construct the protocol so as to accommodate the necessary lexical data in a specified format.
Consumers must implement the protocol with all related character information, including charset, language, and
locale.
Storage and Interchange
Providers must allow for storage of any lexical and syntactic data, supplying formats that contain relevant information
for proper retrieval.
Consumers must include all relevant information in the storage and interchange formats so that lexical and syntactic
data in any charset can be properly retrieved.
Application Programming Interfaces
Providers must supply interfaces that accommodate any lexical and syntactic data, where relevant.
Consumers must supply relevant lexical or semantic data descriptions to the API functions to properly process lexical or
syntactic data.
|
|