Sun Java Solaris Communities My SDN Account
 
Software Globalization
 

Contents

  • Introduction
  • The Challenge
  • The Solution
  • StarOffice Internationalization Framework Architecture
  • Conclusion
  • Further Information
  • References


  • Introduction

    StarOffice from Sun is a cross-platform office productivity suite, which includes the following applications:

    • StarOffice Writer - A word processing application.

    • StarOffice Calc - a spreadsheet application that enables you to analyze data and perform calculations.

    • StarOffice Impress - A tool for creating presentations.

    • StarOffice Draw - A vector-oriented drawing tool for creating and editing images.

    All of these tools are built on a robust graphics framework and can be customized using the StarOffice Basic language, which enables developers to create a wide range of office applications.



    top

    The Challenge

    StarOffice developers at Sun faced the following challenges:
    • Supporting global users in multilingual locales

      In today's global market place, applications like StarOffice must be able to run on a variety of platforms while enabling users to input, process, and display data in several languages. These applications should not only provide support for Western and Eastern European locales, but also Chinese, Japanese, and Korean (CJK) locales and complex text layout (CTL) locales, such as Arabic, Thai, and Hebrew.


    • Replacing the single-byte framework with a multilingual Unicode-based framework

      StarOffice 5.2 is a single-byte application that only supports Western and Eastern European languages. The internationalization framework in StarOffice 5.2 is based on an International class with tables for specific locales. Currency symbols, date separators, and decimal and grouping digit punctuators were obtained by calls to methods like GetNumDecimalSep(). The International class also provids methods for character classification, for example, ToUpper(), ToLower(), and the case insensitive StringCompare(). Since locale data is linked with the binary data at compilation, it is necessary to recompile the product for any locale data modification.


    • Inconsistent support across multiple platforms

      StarOffice runs on the following platforms:

      • Sun's SPARC systems running Solaris
      • Intel systems running Solaris
      • Linux
      • Microsoft Windows 95, Windows 98, Windows 2000, Windows Me, and Windows NT


      The internationalization support on these platforms, however, is inconsistent and StarOffice cannot rely on it directly. Furthermore, StarOffice requires far more complex internationalization APIs than those available in the platform internationalization framework.



    top

    The Solution

    This section describes how the new StarOffice internationalization framework addresses these challenges:

    How to Support Global Users

    To provide support for global users in multilingual language environments, StarOffice developers have built a universal internationalization framework. The StarOffice internationalization framework provides a rich set of APIs to internationalize StarOffice applications using the Universal Network Objects (UNO) component model. UNO is an interface-based object model like COM or CORBA that is used to integrate all StarOffice components. The UNO is designed to be as efficient as COM with additional features.

    You can now modify a locale behavior or add a new locale without modifying or recompiling the source code. The StarOffice internationalization framework is platform-independent and universally accessible to any CORBA or COM components irrespective of their programming language through the UNO remote bridges for CORBA and OLE. The framework is also reusable outside StarOffice, for example, in GNOME.

    How to Replace the Single-Byte Framework with a Multilingual Unicode-Based Framework

    The decision was made to integrate Unicode character handling in the upcoming version of StarOffice to support all European, Asian, and BiDi languages, and to base the new internationalization framework on UNO. In a time frame of just four weeks, the StarOffice development team succeeded in changing about 7 million lines of C++ code to Unicode. This was made possible largely due to the fact that in StarOffice 5.2, character representation was handled using a C++ String class and was platform-independent. Another advantage was that more than 80% of the code was system independent.

    The Unicode conversion process involved the following stages:

    • Creating a new class UniString

      #define String UniString

      Before creating the new class UniString, the class String was renamed to ByteString. The class UniString implementation includes conversions from and to the class ByteString for several character encodings. The class UniString was designed to have the same methods and functionality as the class ByteString. StarOffice developers added Stream methods to write ByteString from UniString and to read ByteString into UniString.

    • Using a macro to switch String to UniString

      By creating the new class UniString, all StarOffice code uses UniString. Since the UniString class has all the methods of ByteString, the String class is replaced seamlessly. All file read/write code was changed to explicitly read and write ByteString. Stream operators << and >> were implemented for ByteString but were not implemented for UniString, in order to force developers to use the right methods.


    The resource file system in StarOffice 5.2 was already enabled to read UTF-8 strings and display Unicode; however, developers implemented additional Unicode file and clipboard I/O for the upcoming version of StarOffice.

    Unicode Conversion Problems

    To change to Unicode, you need a good base Unicode String class. This class must be able to insert, search, compare, and replace any ASCII characters because you do not want to change all strings to Unicode. For example, in RTF or HTML files, you only want to convert specific content. You do not want to convert tokens, which are ASCII characters, to Unicode characters.

    New code converters were introduced to convert from Unicode to the legacy code set and vice versa. Another requirement was the ability to load and save data in files, such as configuration files, database files, and other file formats. For example, you must be able to save data in its own binary format and in other third party formats, such as text, RTF, and WinWord.

    An Extensible and Pluggable Internationalization Framework

    The StarOffice architecture is based on a layered approach to allow easy porting to different platforms. There are four well-defined layers:

    • System Abstraction Layer - This layer encapsulates all system specific APIs and provides a consistent object-oriented API to access system resources.
    • Infrastructure Layer - This layer provides a platform independent environment for building applications, components, and services.
    • Framework Layer - This layer provides the environment for each application and all shared functionality, such as dialog boxes, file access, and configuration management.
    • Application Layer - This layer includes all OpenOffice.org applications. The way these applications interact is based on the lower layers.


    StarOffice Architecture

    Figure 1: StarOffice Layered Architecture



    The upcoming version of StarOffice will enable developers to create customized applications using modules in the framework layer. Although these modules are developed using C++, they are all UNO components. UNO components can be used by modules in other languages and can run on different hosts. The new internationalization framework has several UNO components. This means that it is accessible to any component in any language. StarOffice internationalization requirements include:

    Unicode 3.0 Support

    To enable multi-lingual document processing on all platforms, StarOffice uses Unicode to represent characters. The internationalization framework must provide a character classification mechanism to support Unicode 3.0. The character classification API must be able to handle multiple code points per character.

    Encapsulation

    In the future, the StarOffice development team plan to support up to 76 locales. All locale-sensitive behavior must be encapsulated in the internationalization framework APIs to support additional locales. For example, users might want to search a document for a particular string. The search might include an option to perform a case-insensitive search; however, case-insensitive searches are irrelevant in the case of Japanese documents. For Japanese, it makes more sense to perform a search without distinguishing between katakana and hirigana characters. These options are locale-sensitive and must be encapsulated within the internationalization framework.

    Pluggable Locale Support

    Since StarOffice supports many locales, the locale support is prone to error. The new internationalization framework must make it easier to add or modify locale behavior. If a customer finds a bug in the behavior of a specific locale, the internationalization framework must enable you to remove the error prone module and replace it with a new one without affecting the StarOffice binary. By developing the internationalization framework using the UNO component model, locale behavior can be easily modified in the UNO repository.

    Collation

    Users can choose more than one collation algorithm to sort data. This means that collation APIs must provide an interface to query the collation algorithms for the locale and enable users to select the collation algorithm that they want to use. Collation can be used by end-users to sort data, as well as internally by the application to sort file names and font names. The collation rule that an application uses to sort and display font or file names does not have to be very strict. For example, in the Japanese locale, the application can ignore the difference between half-width and full-width characters. The options are locale-specific and cannot be specified in the application. The collation API must provide abstract and yet easy-to-use options that map onto locale-sensitive options.

    Number Formatter

    In StarOffice 5.2, the number formatter makes extensive use of locale data and number format codes provided by the internationalization framework. For the upcoming version of StarOffice, new keyword symbols, parsing methods, and string output methods have been developed to enable the number formatter to make use of the new calendar API, and to use different calendars in the same format code. One particular goal, was not only to create a calendar format that behaves the same way as that in Japanese Microsoft Excel, for example, but also to display any combination of calendar systems for a locale, as long as the locale data provides information about them.

    Calendar

    The calendar API provides an interface for performing date arithmetic based on various calendars. Even though most of the locales support the Gregorian calendar by default, many locales support additional calendars. For example, the Japanese locale supports the Emperor Era calendar as well as the Gregorian calendar; hence, the calendar API should have an interface to query the available calendars for any locale.

    Break Iterator

    The internationalization framework must provide APIs to iterate a string by character, word, line, and sentence. Iterating characters is essential for two reasons:

    • Cursor movement — The UniString class has an array of code points. Since a character can take more than one code point, cursor movement cannot be done by incrementing or decrementing the index.

    • CTL languages, such as Arabic, Thai, and Hebrew — In CTL languages, multiple characters combine to form a display cell. Cursor movement must jump a display cell instead of a single character.

    Line breaking must be highly configurable in desktop publishing applications. The line breaking algorithm must be able to find a line break with or without a hyphenator. The line breaking API must also be able to parse special characters that are illegal if they occur at the end or beginning of a line. The character, word, and line breaking algorithms are locale-sensitive and must be pluggable.



    top

    The New StarOffice Internationalization Framework Architecture

    The new StarOffice internationalization framework includes the following major components:

    • Locale data
    • Character classification
    • Collation
    • Break iterator
    • Transliteration
    • Find/Replace

    Each component of the framework is an UNO component. The following figure shows the interaction between various components:



    StarOffice Internationalization Framework Architecture

    Figure 2: StarOffice Internationalization Framework Architecture

    Since all components are locale-sensitive, each component is written under a unique service name. StarOffice defines the naming convention for each component. For example, the service name convention for a break iterator object is as follows:

    com.sun.staroffice.i18n.impl.<locale_name>.breakiterator.

    If you run StarOffice in the Thai locale, it loads the following service:

    com.sun.staroffice.i18n.imp.th_TH.breakiterator.

    Developers can register their Thai break iterator module against the service name and StarOffice will automatically load it at runtime. By following the naming convention, any locale-sensitive component can be plugged into the StarOffice binary repository dynamically. Hence, StarOffice locale behavior can be enhanced without recompiling.

    Even though every locale-sensitive component can be registered using a unique service name, it is not possible to register all components for all locales. For example, the break iterator for different Spanish locales, that is, other Spanish speaking regions, is the same as that in the locale for Spain (es_ES). The modules referred to as stubs in the Figure 1 provide fallback functionality; that is, a service that is guaranteed to be available. StarOffice modules use the stub modules to parse the locale information. A stub module attempts to locate a locale-sensitive module using the service naming convention. If such a service is unavailable, the stub module attempts one more time without using the country name. Even if it fails, it loads a default module. For example, to locate a break iterator for the French Canadian locale, it attempts to locate a break iterator service for fr_CA. If the break iterator for fr_CA is unavailable, the stub module attempts to locate a break iterator for fr. If it is not available, it falls back to the default break iterator.

    Adding a New Locale

    1. Identify any locale-sensitive modules that can be reused, for example, calendar and collation. If reusable modules exist, note down the service names of these modules.
    2. Check if you need any special break iterator and character classification modules. If these modules require modifications to the default one provided by StarOffice, then develop them.
    3. Create locale data in XML format. The XML file requires data that is specific to the locale, for example, currency symbols, format codes, collators, and calendars.
    4. Run the XML parser to generate the C++ files.
    5. Provide the C functions component_writeinfo() and component_getfactory() to register the locale data object, break iterator object, and character classification object.
    6. Compile the C++ files to generate shared objects or DLLs.
    7. Identify the StarOffice binary repository; the binary repository file is usually applicat.rdb.
    8. Run the regcomp utility to register the DLL with applicat.rdb.
    9. Create a new XML file for the locale data.
    10. Create a locale-specific collator, transliteration, and UNO object.
    11. Reference the UNO object service names in the locale data XML file and register them in the UNO registry.
    12. Convert the locale data XML file into a C++ file and then convert the C++ file into a DLL.


    top

    Conclusion

    StarOffice is not just an office productivity suite; it is a completely object-oriented platform for developing any cross-platform desktop application. The new StarOffice internationalization framework is Unicode based and offers a rich set of APIs, which meet the requirements of existing applications and are also generic enough to be used for any application developed on this platform. The APIs encapsulate all localization behaviour inside the internationalization framework. This means that localization developers can add new locales or enhance existing locale behaviour to meet regional market requirements without modifying the StarOffice binary. The internationalization framework is accessible to CORBA/UNO components, which makes the StarOffice internationalization framework universal.



    top

    Further Information

    For a demo of StarOffice multilingual features, why not visit our booth at the
    19th International Unicode Conference in San Jose, California, September 10-14.



    top

    References

    OpenOffice Home Page

    OpenOffice Localization and Internationalization Project

    UNO Home Page

    Introduction to UNO

    Internationalization API



    top

    Oracle is reviewing the Sun product roadmap and will provide guidance to customers in accordance with Oracle's standard product communication policies. Any resulting features and timing of release of such features as determined by Oracle's review of roadmaps, are at the sole discretion of Oracle. All product roadmap information, whether communicated by Sun Microsystems or by Oracle, does not represent a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. It is intended for information purposes only, and may not be incorporated into any contract.