Contents
IntroductionThe scripts of South and Southeast Asia have many structural similarities: most are phonetic, most are written from left to right, most use spaces or marks between phrases, and so on. Most of these scripts are derived from the ancient Brahmi script.In India there are 15 officially recognized languages and writing scripts: Hindi, Marathi, Sanskrit, Punjabi, Gujarati, Oriya, Bengali, Assamese, Telugu, Kannada, Malayalam, Tamil, Urdu, Sindhi, and Kashmiri. Of these, Urdu, Sindhi, and Kashmiri are usually written in Perso-Arabic scripts. Sometimes they are written in Devanagari. Apart from Perso-Arabic scripts, the remaining ten scripts have evolved from the ancient Brahmi script and have a common phonetic structure, which allows a common character set among these scripts. Hindi, named the official language used by India's central government in 1949, is written in the Devanagari script. Devanagari is also used for writing Marathi and Sanskrit and is also the official script of Nepal. Unicode (ISO 10646) covers most of the recognized scripts in India today. However, this standard requires further elucidation before it will be completely effective. The purpose of this document is to clarify important implementation issues pertaining to Indic scripts. Commonalities in the Indian LanguageThe 15 major scripts of India, including Devanagari, are encoded according to a common goal: comparable characters are in the same order and the same relative location. This structural arrangement, which facilitates transliteration to some degree, is based on the Indian national standard (ISCII) encoding for scripts. This standard, which provides a common code and keyboard for Indian scripts, was introduced in 1983 by the Indian Department of Education. ISCII, which is derived from the Brahmi script, was adopted by the Bureau of Indian Standards (BIS) in 1991. Unlike Unicode, ISCII is an 8-bit encoding that uses escape sequences to announce the particular Indic script represented by a coded character sequence.While retaining the ASCII character set in the lower half, ISCII provides the Indian script character set in the upper 96 characters. The Indian script keyboard overlay is designed for the standard English QWERTY overlay and ensures that English text can co-exist with Indian scripts. This approach also makes it possible to use Indian scripts with existing English hardware and software, so long as 8-bit character codes are allowed. For a more thorough description of these character encoding standards,
see the report from the Center for Development of Advanced Computing at
There are several advantages in having a common code and keyboard for all the Indian scripts. Software that allows ISCII codes to be used in Indian scripts is more commercially viable. Furthermore, immediate transliteration between different Indian scripts is possible just by changing the display modes. Simultaneous availability of multiple Indian languages in the computer medium will accelerate the script's development and facilitate national integration. The ISCII standard can be found in the Bureau of Indian Standard Documents No. IS:13194-1991. Indic Script Implementation Support: Challenges and ConsiderationsOverall Support ChallengesLanguage DiversityTen major Indian languages and writing scripts are supported by Unicode: Hindi, Marathi, Gujarati, Tamil, Telugu, Kannada, Punjabi, Bengali, Oriya, and Malayalam. In addition, there are numerous dialects or minor languages that share scripts with these major languages, but are different in combination as well as display and sorting.Lack of Presentation StandardsThe lack of presentation standards creates considerable difficulty in pre-processing scripts. Like most complex scripts, the reordering and context shaping of a character set depend considerably on the availability of standardized local language glyph sets (as with Arabic and Thai presentation), and character clustering rules. Unfortunately this is not the case with Indic scripts, where standards either do not exist or are in the process of evolving. Even the KGP standard (Karnataka Ganaka Parishad ) is applicable only for Kannada and Malayalam scripts. In other scripts, such as Tamil, there are competing standards which need to be supported.A detailed description of KGP standards, including instructions on downloading software, are available at: http://bangaloreit.com/html/education/Nudi.html Unicode Standard IssuesThe current Unicode standard for Indian languages is based on the Indian Standard Code for Information Interchange (ISCII-1988). Thus, in any given Indic script, the consonant and vowel letter codes of Unicode are based on ISCII-1988.There are some differences between Unicode and ISCII. Unicode is a multilingual encoding that requires no escape sequences or switching between scripts. For any given Indic script, the consonant and vowel letter codes of Unicode are based on ISCII, and therefore correspond directly. In contexts that require format controls (such as the ISCII invisible (INV) operator and isolated vowel matras or explicit virama), the spelling mechanisms of Unicode and ISCII differ slightly but are easily converted from one to another without loss of information. Some of the other similarities and differences between Unicode and ISCII are described in the Indian Languages FAQ section of the unicode.org web site at http://www.unicode.org/faq/indic_old.html. Topics include:
Other Implementation ConsiderationsThe following sections provide examples of some of the inconsistencies in character support between ISCII and Unicode.Use of ZWJ and ZWNJFormat controls (such as the ISCII "INV" operator and isolated vowel matras or explicit virama) and spelling mechanisms of Unicode and ISCII differ slightly, but are easily converted without data loss (see http://www.unicode.org/faq/indic_old.html). For example, the explicit Halant uses:
Nukta ConsonantsNukta consonants exist in Unicode, but not in ISCII. To most Indian language developers, these are generally considered to be an unnecessary addition to an already complex script.The nukta consonants are as follows:
Decomposed VowelsAs with nukta consonants, decomposed vowels add another level of complexity to a script. For example, the Unicode standard two-part vowels, Tamil 'vowel sign O', can be composed with 'E' + 'AA'. Though the resultant output looks identical, it adds additional logic to collation, search and replace, and so on. Also, as mentioned earlier, scripts are shared across different languages. These additional characteristics and considerations add to the complexity of processing Indic scripts.Processing and Presentation ConsiderationsCanonical RepresentationBecause Indic scripts use a large number of combining characters, the development of a unique data representation standard will help simplify searching and sorting operations.In addition to the differences described in previous sections, the inclusion of the topics described below will improve the implementation of Indic scripts considerably. Rendering ExamplesCurrently only Hindi and Tamil rendering examples (to some extent) are provided in a standardized fashion. However, South Indian languages such as Telugu and Kannada have additional issues related to 'Vattus' which need to be illustrated.SortingUnicode code-point order is admittedly not intended to solve culturally acceptable sorting. However, sorting is frequently a source of confusion. Providing a default collating order for each script would be helpful in clarifying this development issue.Text Processing IssuesThere are many factors to consider in selecting codes to represent letters and other written shapes in the writing system for a language. It is not surprising, then, to see more than one script in use for some of the Indian languages. This is possible because the script reflects the sounds of the individual aksharas, and thus the same phonetic information may be written in different scripts as long as there is a well-defined way to write the aksharas in each script. The 11 or so scripts in use in India do carry phonetic information in a fairly uniform manner across the languages. Also, both Unicode and ISCII have limitations for standardized screen presentation and printing of Indic scripts.Other text presentation issues include:
Indic Script Support in the Solaris Operating SystemThis section outlines the approach adopted for Indic script support in Solaris for select implementation areas.The implementation of a standard requires providing examples of the correct use of the standard with respect to canonical representation, searching and sorting, and data interoperability. This reduces the hacked implementations common today, namely the use of non-standard fonts and non-standard encoding ("x-user-defined"). Supporting a new language/region and scripts in an application or platform typically affects adding or enhancing the following categories of components:
Input MethodsA variety of intelligent input methods is necessary to ease data entry in Indic scripts. The complexities of Indic presentations should be hidden from the user. There are a number of possible approaches for input methods.
Output/DisplayThe presentation of Indian language scripts requires contextual processing for display and editing. This output technology is called Complex Text Layout (CTL). The CTL properties of Indian language scripts include:
Because of these problems, there are numerous technical considerations for Indic script rendering. The recommended implementation approach is to use a combination of CTL APIs and Intelligent font technology. Some current APIs, similar to ATSUI, are Uniscribe, ICU + FT2, or Pango + FT2. For Indic script support in Solaris, Sun chose to follow a two-phased approach. In the first phase, Solaris 9 Indic text output builds on the CTL non-intelligent font support currently in Solaris CDE/Motif. (CTL is an implementation of the X/Open Portable Layout Services API.) A font encoding that covers eight Indian languages using Unicode and Private User Area (PUA) is in development. The following illustrations provide examples of Indic script processing. CTL properties of Indian language scripts include:
![]()
Text OperationsIndian languages have their own semantics for dealing with text editing. Some text processing features which need to adapt to the new semantics are caret handling and selection:
In this example:
Locale DataThe locale data for Indic scripts is identical to other Solaris locale data. POSIX implementation is standard for these scripts.The implementation of locale data is detailed in the International Language Environments Guide, available on docs.sun.com. Data ExchangeSolaris 9 supports Unicode conversions to and from ISCII-1991 and PC-ISCII with the following exceptions:
Chillikaksharams used in Malayalam. These are currently missing in the Unicode codepage. ConclusionSun's Commitment to Indian Language SupportSun is one of the first companies to use PLS APIs, developed by OpenGroup/X-Open more than eight years ago, for its CTL script implementation in Solaris. Support was initially provided for Thai, Arabic, and Hebrew, and recently for Indic scripts in Solaris 9.More recently, with a larger international support for open source initiatives (Pango, ICU, and so on) new APIs have been developed for CTL, including OpenType technology. Sun is actively participating in some of these open source initiatives and is leading the Indian language support for Mozilla. Other Major Areas of DevelopmentDuring the past eight years, Sun has developed a number of initiatives that provide Indic script support in multiple environments.Hindi support was added to CDE/Motif in the latest release of Solaris
9, which also includes another seven Indian scripts. Currently only Sun
supports an Indian script (that is, a CTL script) in CDE/Motif, and is
also in the process of transitioning from CDE to GNOME as the default desktop.
Other areas include the addition of Sun's Universal Multiscript Layout
Engine (UMLE/LE) to PLS APIs, and the development and open sourcing of
a rich text and typography framework (STSF
Framework) supporting complex text layout scripts such as Indic scripts
by Sun.
|
| |||||||||||||||||||||||||||||||||||
|
| ||||||||||||