Sun Java Solaris Communities My SDN Account Join SDN
 
Article

Challenges in Supporting Indic Scripts

 
NEW! Download Sun's Indian Language TrueType Font 

Contents

Introduction 

The scripts of South and Southeast Asia have many structural similarities: most are phonetic, most are written from left to right, most use spaces or marks between phrases, and so on. Most of these scripts are derived from the ancient Brahmi script.

In India there are 15 officially recognized languages and writing scripts: Hindi, Marathi, Sanskrit, Punjabi, Gujarati, Oriya, Bengali, Assamese, Telugu, Kannada, Malayalam, Tamil, Urdu, Sindhi, and Kashmiri. 

Of these, Urdu, Sindhi, and Kashmiri are usually written in Perso-Arabic scripts. Sometimes they are written in Devanagari. Apart from Perso-Arabic scripts, the remaining ten scripts have evolved from the ancient Brahmi script and have a common phonetic structure, which allows a common character set among these scripts. 

Hindi, named the official language used by India's central government in 1949, is written in the Devanagari script. Devanagari is also used for writing Marathi and Sanskrit and is also the official script of Nepal. 

Unicode (ISO 10646) covers most of the recognized scripts in India today. However, this standard requires further elucidation before it will be completely effective. The purpose of this document is to clarify important implementation issues pertaining to Indic scripts.

Back to Top

Commonalities in the Indian Language 

The 15 major scripts of India, including Devanagari, are encoded according to a common goal: comparable characters are in the same order and the same relative location. This structural arrangement, which facilitates transliteration to some degree, is based on the Indian national standard (ISCII) encoding for scripts. This standard, which provides a common code and keyboard for Indian scripts, was introduced in 1983 by the Indian Department of Education. ISCII, which is derived from the Brahmi script, was adopted by the Bureau of Indian Standards (BIS) in 1991. Unlike Unicode, ISCII is an 8-bit encoding that uses escape sequences to announce the particular Indic script represented by a coded character sequence. 

While retaining the ASCII character set in the lower half, ISCII provides the Indian script character set in the upper 96 characters. The Indian script keyboard overlay is designed for the standard English QWERTY overlay and ensures that English text can co-exist with Indian scripts. This approach also makes it possible to use Indian scripts with existing English hardware and software, so long as 8-bit character codes are allowed.

For a more thorough description of these character encoding standards, see the report from the Center for Development of Advanced Computing at
http://www.cicc.or.jp/english/hyoujyunka/mlit4/7-3India/India.htm.

There are several advantages in having a common code and keyboard for all the Indian scripts. Software that allows ISCII codes to be used in Indian scripts is more commercially viable. Furthermore, immediate transliteration between different Indian scripts is possible just by changing the display modes. Simultaneous availability of multiple Indian languages in the computer medium will accelerate the script's development and facilitate national integration. 

The ISCII standard can be found in the Bureau of Indian Standard Documents No. IS:13194-1991. 

Back to Top

Indic Script Implementation Support: Challenges and Considerations 

Overall Support Challenges

Language Diversity
Ten major Indian languages and writing scripts are supported by Unicode: Hindi, Marathi, Gujarati, Tamil, Telugu, Kannada, Punjabi, Bengali, Oriya, and Malayalam. In addition, there are numerous dialects or minor languages that share scripts with these major languages, but are different in combination as well as display and sorting.
Lack of Presentation Standards
The lack of presentation standards creates considerable difficulty in pre-processing scripts. Like most complex scripts, the reordering and context shaping of a character set depend considerably on the availability of standardized local language glyph sets (as with Arabic and Thai presentation), and character clustering rules. Unfortunately this is not the case with Indic scripts, where standards either do not exist or are in the process of evolving. Even the KGP standard (Karnataka Ganaka Parishad ) is applicable only for Kannada and Malayalam scripts. In other scripts, such as Tamil, there are competing standards which need to be supported.

A detailed description of KGP standards, including instructions on downloading software, are available at:

http://bangaloreit.com/html/education/Nudi.html

Unicode Standard Issues
The current Unicode standard for Indian languages is based on the Indian Standard Code for Information Interchange (ISCII-1988). Thus, in any given Indic script, the consonant and vowel letter codes of Unicode are based on ISCII-1988. 

There are some differences between Unicode and ISCII. Unicode is a multilingual encoding that requires no escape sequences or switching between scripts. For any given Indic script, the consonant and vowel letter codes of Unicode are based on ISCII, and therefore correspond directly. In contexts that require format controls (such as the ISCII invisible (INV) operator and isolated vowel matras or explicit virama), the spelling mechanisms of Unicode and ISCII differ slightly but are easily converted from one to another without loss of information.

Some of the other similarities and differences between Unicode and ISCII are described in the Indian Languages FAQ section of the unicode.org web site at http://www.unicode.org/faq/indic_old.html. Topics include:

  • Differences between fonts 
  • How do the Indic scripts work in Unicode? 
  • Does Unicode cover Vedic accents? 
  • Invisible letters (INV)
To read more about ISCII and Unicode development and issues with different Indic scripts, see the article at http://acharya.iitm.ac.in/multi_sys/uni_iscii.html.

Back to Top

Other Implementation Considerations 

The following sections provide examples of some of the inconsistencies in character support between ISCII and Unicode.
Use of ZWJ and ZWNJ
Format controls (such as the ISCII "INV" operator and isolated vowel matras or explicit virama) and spelling mechanisms of Unicode and ISCII differ slightly, but are easily converted without data loss (see http://www.unicode.org/faq/indic_old.html). For example, the explicit Halant uses:
 
ISCII: Halant + Halant
Unicode: Halant + ZWJ
Nukta Consonants
Nukta consonants exist in Unicode, but not in ISCII. To most Indian language developers, these are generally considered to be an unnecessary addition to an already complex script.

The nukta consonants are as follows:

Decomposed Vowels
As with nukta consonants, decomposed vowels add another level of complexity to a script. For example, the Unicode standard two-part vowels, Tamil 'vowel sign O', can be composed with 'E' + 'AA'. Though the resultant output looks identical, it adds additional logic to collation, search and replace, and so on. Also, as mentioned earlier, scripts are shared across different languages. These additional characteristics and considerations add to the complexity of processing Indic scripts. 

Back to Top

Processing and Presentation Considerations

Canonical Representation
Because Indic scripts use a large number of combining characters, the development of a unique data representation standard will help simplify searching and sorting operations.

In addition to the differences described in previous sections, the inclusion of the topics described below will improve the implementation of Indic scripts considerably.

Rendering Examples
Currently only Hindi and Tamil rendering examples (to some extent) are provided in a standardized fashion. However, South Indian languages such as Telugu and Kannada have additional issues related to 'Vattus' which need to be illustrated.
Sorting
Unicode code-point order is admittedly not intended to solve culturally acceptable sorting. However, sorting is frequently a source of confusion. Providing a default collating order for each script would be helpful in clarifying this development issue. 
Text Processing Issues
There are many factors to consider in selecting codes to represent letters and other written shapes in the writing system for a language. It is not surprising, then, to see more than one script in use for some of the Indian languages. This is possible because the script reflects the sounds of the individual aksharas, and thus the same phonetic information may be written in different scripts as long as there is a well-defined way to write the aksharas in each script. The 11 or so scripts in use in India do carry phonetic information in a fairly uniform manner across the languages. Also, both Unicode and ISCII have limitations for standardized screen presentation and printing of Indic scripts. 

Other text presentation issues include:

  • Variable length (multibyte) representations 
  • Sorting order 
  • Codes for aksharas as opposed to shapes
  • Data preparation versus linguistic processing 
  • Transliteration across scripts
These issues are discussed in detail in a report at: http://acharya.iitm.ac.in/multi_sys/uni_iscii.html.

Back to Top

Indic Script Support in the Solaris Operating System

This section outlines the approach adopted for Indic script support in Solaris for select implementation areas.

The implementation of a standard requires providing examples of the correct use of the standard with respect to canonical representation, searching and sorting, and data interoperability. This reduces the hacked implementations common today, namely the use of non-standard fonts and non-standard encoding ("x-user-defined").

Supporting a new language/region and scripts in an application or platform typically affects adding or enhancing the following categories of components:

  • Input methods 
  • Output (display and printing) 
  • Language- and region-specific components: character classifications, calendaring, sort orders/collation, transliteration, and indexing 
  • Data exchange and interoperability

Back to Top

Input Methods 

A variety of intelligent input methods is necessary to ease data entry in Indic scripts. The complexities of Indic presentations should be hidden from the user. There are a number of possible approaches for input methods.
  • Multiple keyboard overlays 
  • Transliteration-based input 
  • Dictionary-based input 
  • Voice recognition
The Indic input method in Solaris 9 currently uses Compose key sequences. The input method supports 'INSCRIPT,' a popular native Indian keyboard overlay standard. Future enhancements could be made to handle multiple overlays (INSCRIPT + One or more overlays) that are phonetic- and transliteration-based. 

Back to Top

Output/Display 

The presentation of Indian language scripts requires contextual processing for display and editing. This output technology is called Complex Text Layout (CTL). The CTL properties of Indian language scripts include:
  • Context sensitivity 
  • Combining characters 
  • Script reordering
In addition, including support for Indian languages has other special challenges, such as the lack of font standardization and the lack of uniform typographic framework across Windows and UNIX (including UNIX variants), the two main platforms on which Indian desktop applications run.

Because of these problems, there are numerous technical considerations for Indic script rendering. The recommended implementation approach is to use a combination of CTL APIs and Intelligent font technology. Some current APIs, similar to ATSUI, are Uniscribe, ICU + FT2, or Pango + FT2. For Indic script support in Solaris, Sun chose to follow a two-phased approach. 

In the first phase, Solaris 9 Indic text output builds on the CTL non-intelligent font support currently in Solaris CDE/Motif. (CTL is an implementation of the X/Open Portable Layout Services API.) A font encoding that covers eight Indian languages using Unicode and Private User Area (PUA) is in development.

The following illustrations provide examples of Indic script processing.

CTL properties of Indian language scripts include:

  • Context sensitivity


  • Combining characters


 
  • Script reordering (for example, Matra reordering before consonant cluster) 


Back to Top

Text Operations 

Indian languages have their own semantics for dealing with text editing. Some text processing features which need to adapt to the new semantics are caret handling and selection:
  • Caret/Cursor placement and movement 
  • <Left> and <Right> arrow keys skip over clusters 
  • <Delete>/Del key deletes the entire cluster 
  • <BKSP>/Backspace composes the cluster by character 
  • Line, word, and sentence breaking 
  • Character breaking by Cluster 
  • Line break by 'virama'
  • Word break by space
  • Arrows and mouse need to select entire clusters 
  • Left mouse click needs to snap to the nearest cluster boundary 
  • Cut, copy, and paste
A sample text edit operation is shown below:


 
 
 

In this example:

  • Left and right arrows traverse the entire display unit (all three codepoints)
  • Backspace deletes character-by-character (you must press backspace three times to delete the glyphs shown in RHS) 
  • Pressing Delete deletes the entire cluster (all three characters forming the display unit)

Back to Top

Locale Data 

The locale data for Indic scripts is identical to other Solaris locale data. POSIX implementation is standard for these scripts. 

The implementation of locale data is detailed in the International Language Environments Guide, available on docs.sun.com.

Back to Top

Data Exchange 

Solaris 9 supports Unicode conversions to and from ISCII-1991 and PC-ISCII with the following exceptions:
  • No distinction made between Bengali and Assamese
  • ISCII font and style attributes 

  • Chillikaksharams used in Malayalam. These are currently missing in the Unicode codepage.

Back to Top

Conclusion 

Sun's Commitment to Indian Language Support

Sun is one of the first companies to use PLS APIs, developed by OpenGroup/X-Open more than eight years ago, for its CTL script implementation in Solaris. Support was initially provided for Thai, Arabic, and Hebrew, and recently for Indic scripts in Solaris 9.

More recently, with a larger international support for open source initiatives (Pango, ICU, and so on) new APIs have been developed for CTL, including OpenType technology.

Sun is actively participating in some of these open source initiatives and is leading the Indian language support for Mozilla.

Back to Top

Other Major Areas of Development  

During the past eight years, Sun has developed a number of initiatives that provide Indic script support in multiple environments.

Hindi support was added to CDE/Motif in the latest release of Solaris 9, which also includes another seven Indian scripts. Currently only Sun supports an Indian script (that is, a CTL script) in CDE/Motif, and is also in the process of transitioning from CDE to GNOME as the default desktop. Other areas include the addition of Sun's Universal Multiscript Layout Engine (UMLE/LE) to PLS APIs, and the development and open sourcing of a rich text and typography framework (STSF Framework) supporting complex text layout scripts such as Indic scripts by Sun. 
 

Related Links