|
Abstract
AbstractProviding current and correct locale data has historically been the responsibility of each platform owner, with varying degrees of success. The result is the long-standing problem of inconsistencies and errors in locale data. This problem is now being addressed in the open-source project, Common Locale Data Repository (CLDR). Begun in 2002 by industry leaders such as Sun and IBM, who have a large stake in a cross-platform solution, the CLDR project will provide the most widely and internationally accepted locale data values across platforms. This article describes CLDR and identifies the problems it aims to solve. The project status is also discussed, as well as future plans for wider industry acceptance. IntroductionDo today's platforms contain correct locale data? Probably not, if "correct" means the most widely and internationally accepted current values. Since there is no standard set of locale data across platforms, each platform owner maintains its own locale data. Delivering correct locale data to end users is not a trivial task for platform owners. The need to reconcile cross-platform locale data differences and to increase efficiency in locale data maintenance prompted Sun's globalization engineers to become involved in the Common XML Locale Repository (CXLR) project, started by the OpenI18n Free Standards Group. As the project developed, its name was changed to the Common Locale Data Repository (CLDR) project. Many platform vendors today face the difficult task of increasing cross-platform interoperability without sacrificing global support. Ensuring operability across heterogeneous platforms, systems, and environments is now an important consideration for globalization engineers delivering locale support. Some specific problems facing platform vendors include:
Handling Linguistic and Cultural DifferencesGlobalization has fueled the need for greater language support in today's platforms. Linguistic and cultural differences throughout the world present challenges to globalization engineers who deliver support for them, especially when reconciling differences in cross-platform locale data. Traditional locale data covers a broad set of linguistic and cultural elements:
Deciding Locale DefinitionsLanguages and countries have varying requirements for processing or presenting data. The term "locale" is generally understood to mean a set of user preferences addressing linguistic and cultural requirements. Locales are used by many vendors in many platform environments. Although locales are widely used today, their implementations vary considerably from one vendor to another and even among products by the same vendor. The proliferation of different locale mechanisms creates difficulties when trying to reconcile locale data differences. Examples of platforms with their own locale data include OpenOffice.org, ICU, and POSIX and POSIX-like operating systems such as Solaris, Linux, and AIX. Also, what a locale means, what it should contain, and whether a standard for a locale currently exists are not widely agreed upon in the industry. Certain standards are available that describe the general structure of a locale, but even systems based on open standards (such as X/Open) can vary in their locale support. Differences in the number of locales delivered, the data contained within the locales, and the values of the locale data are common. Locales from the same vendor often share the same locale name or id, yet contain different locale support. Platform vendors often define their own locale definitions or tailor pre-existing definitions to suit their technologies. Some vendors even allow users to define their own locales, often resulting in conflicts with pre-existing standard locales shipped by the vendor. Aside from user-defined locales, variants can exist within a particular locale. These variants often identify a more specific version of a locale, offering additional customization. For example, zh_TW.BIG5@radical and sv_SE.ISO8859@euro are variants offering tailored collation and currency support. Platform vendors are free to define and name variants as they see fit. No standard definition exists that prescribes which combinations of names can exist or their behavior by the technologies that use them. Reconciling Differences in Locale DataClearly every locale must have locale data, but just as there are differences between the locale mechanisms for various platforms, there are also differences between the locale data itself. As standards change, locale offerings and locale data have adapted to reflect these changes. Not every platform changes its locale data at the same time. This can create compatibility and versioning problems. In addition, fallback mechanisms exist if locale support cannot be found, but these mechanisms can vary between platforms, also causing conflicts. The origin of a platform's locale data is not always easy to determine. Standards exist for various countries, but many countries are not represented by a relevant standards body. In these cases platform vendors must decide which locale data to use with the help of linguistic and cultural experts. As a result, differences in locale data can emerge between platforms. In addition, there is no way to determine whether or which standards have been used. A locale name or locale id is often the only clue as to the origin of the locale data. Locating and retrieving locale data from the many organizations that define and maintain internationalization standards and specifications can be time consuming due to their decentralized nature. Because platform locale data and user-defined locale data are not always clearly noted or created with agreed-upon standards, they must be defined by platform vendors themselves. Locale Management at SunSun's globalization engineers recognize the importance of the locale data issues previously discussed. Locale mechanisms with subtle differences in locale data increase the amount of effort required to produce global products. Much of the locale functionality is identical across platforms, but different project infrastructures often mean that each platform must be treated separately from a locale data maintenance perspective. A cross-platform office productivity suite such as StarOffice (based on OpenOffice.org) cannot use the internationalization APIs provided by the underlying platform since the platform-specific APIs can be inconsistent or insufficient to support desktop applications. The StarOffice internationalization framework addresses this problem by providing a rich set of APIs to internationalize applications. The framework uses XML locale data files that are parsed, generating C++ code which in turn is compiled into a UNO object representing a locale. These XML files represent a locale data baseline requiring maintenance. Java inherited a significant portion of its locale data as a result of the internationalization classes provided in the JDK 1.1 release. While these classes offered a better level of cross-platform internationalization support, they also introduced a separate locale data baseline that needed to be maintained and reconciled. Given this complex picture, engineers at Sun eagerly began the Locale Data Repository project, which has evolved to use CLDR and the Locale Data Markup Language from OpenI18n.org. Over several months, Sun's globalization engineers developed locale data for all of Sun's projects and platforms in a single repository of XML files and a set of tools to generate platform-specific locale data source files automatically. The locale data for each platform is stored in a generic format, Locale Data Markup Language (LDML). LDML is the XML format used in CLDR. This forms the core storage and interchange format of the project. Status of the CLDR ProjectIn June 2003, the OpenI18n group announced the release of the locale data markup language specification Version 1.0. The purpose of the project is to devise a general XML format for the exchange of locale data for use in application and system development, and to gather, store, and make available data generated in that format. In January 2004, CLDR version 1.0 was released. The data in version 1.0 can be referenced, but enhancements and fixes will come in future versions. This release begins the normalization process, in which differences in locale data are ironed out. People are requested to review the data for languages and countries that they are familiar with for the next version and file bug reports as appropriate. For a link to the version 1.0 download, see the References section. CLDR will likely be transferred to the Unicode Consortium in spring 2004. SummaryNow that Unicode has established itself as a global standard for character encoding, developers are focusing attention on inconsistencies in locale data across the industry and have identified a promising solution. CLDR, an industry-wide locale data repository, represents a big step toward the goal of shipping correct locale data on all of today's platforms. A single repository visible and vetted by international and national standards bodies, platform vendors, and end users will increase the singular focus required to create a set of correct locale data that is widely accepted and adopted by the industry. CLDR supplies locale data for a wide variety of information types (such as dates, times, numbers, and currencies). Future development might include supplying data for text boundaries (character, word, line, and sentence), text transformations (including transliterations), and support for other services. ReferencesCLDR page on OpenI18n.org: CLDR status and download: To join the Linux Application Development Environment (LADE) Workgroup, see: Paper on CLDR to be delivered at IUC 25: |
| |||||||||||||||||||||||
Oracle is reviewing the Sun product roadmap and will provide guidance to customers in accordance with Oracle's standard product communication policies. Any resulting features and timing of release of such features as determined by Oracle's review of roadmaps, are at the sole discretion of Oracle. All product roadmap information, whether communicated by Sun Microsystems or by Oracle, does not represent a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. It is intended for information purposes only, and may not be incorporated into any contract.
|
| ||||||||||||