Sun Java Solaris Communities My SDN Account Join SDN
 
Article

XLIFF: An Aid To Localization

 

Download the Translation Editor »

Introduction

Translators today can expect to receive documents for translation in any one of several formats:

  • HTML
  • Docbook
  • Microsoft Word (many possible versions)
  • XML (many possible DTDs!)
  • FrameMaker
  • Software resource bundles (many different formats such as .properties, .po, .msg, .java, etc. )
  • etc.

From a translator's point of view, this is quite a difficult mix to deal with. You would need to maintain several editing tools, be proficient in many file formats (knowing the syntax and grammar of each type), and that's before you've even started to translate the content.

As a localization engineer, a similar problem exists: it's difficult to write tools for each file format. For example, if your boss asks you to calculate the number of new words for translation between the last delivery and the current one, you need a tool capable of dealing with all formats or a separate tool for each format.

Normally during localization, files are processed by tools such as translation memories and machine translation systems. Translation memory systems, known as TM systems, work by looking up segments in a database containing a large number of previously translated segments and their translations. (Segments are pieces of source files, usually sentences, that can be translated reasonably independently.) The database might contain segments that match the input segment exactly or segments that are similar to the segment presented for translation. These translations are then provided to the translator as suggested translations for each segment.

Machine translation systems, known as MT systems, are another type of translation technology. Instead of using a large database of existing translation, a machine translation system uses a set of language-specific linguistic rules that describe how to translate sentences into the target language.

The translations these systems produce might undergo some post editing, and any remaining untranslated text is given to human translators to complete. Translations are then reviewed, and sometimes commented on, corrected, or retranslated. Source formats tend not to have support for these localization processes.

XLIFF, which stands for XML Localization Interchange File Format, is a format for exchanging localization data. XLIFF could be used to exchange data between companies, such as a software publisher and a localization vendor, or between localization tools, such as TM systems and MT systems.

What is XLIFF?

XLIFF is an XML-based format that enables translators to concentrate on the text to be translated. Likewise, since it's a standard, manipulating XLIFF files makes localization engineering easier: once you have converters written for your source file formats, you can simply write new tools to deal with XLIFF and not worry about the original file format. It also supports a full localization process by providing tags and attributes for review comments, the translation status of individual strings, and metrics such as word counts of the source sentences.

The XLIFF format grew out of a collaboration between a number of companies, including Sun Microsystems, but was soon brought under the management of an OASIS Technical Committee. In April 2002, the first Committee Specification for XLIFF was published. This is available at http://www.oasis-open.org/committees/xliff/documents/xliff-specification.htm.

The XLIFF format aims to:

  • Separate localizable text from formatting.
  • Enable multiple tools to work on source strings and add to the data about the string.
  • Store information that is helpful in supporting a localization process.

The XLIFF File

In its most basic form, the XLIFF file consists of one or more file elements. Each of these contains a header and a body section. The header contains project data, such as contact information, project phases, pointers to reference material, and information on the skeleton file (explained below). The body section contains trans-unit elements--the main elements in an XLIFF file.

The trans-unit elements store localizable text and its translations. These elements represent segments (usually sentences in the source file that can be translated reasonably independently). The trans-unit elements contain source, target, alt-trans, and a handful of other elements. The example below shows how they would be used.

Example 1. Example of a trans-unit Element

...
 
    <trans-unit id="n1">
    <source>This is a sentence.</source>
    <target xml:lang="fr">Translation of "This is a sentence."</target>
    <alt-trans match-quality="100%" tool="TM_System">
      <source>This is a sentence.</source>
      <target xml:lang="fr">TM match for "This is a sentence."</target>
    </alt-trans>
    <alt-trans match-quality="70%" tool="TM_System">
      <source>This is a short sentence.</source>
      <target xml:lang="fr">Fuzzy TM match for "This is a sentence."</target>
    </alt-trans>
    </trans-unit>

...

This example shows a pseudo-translated segment. The trans-unit element contains an id attribute used to determine where the segment goes in the original document. The trans-unit element has a source and a target element as children. The source element represents the source text (the text to be translated) in the original document. The target element represents the currently accepted translation of the source after linguistic review has taken place.

The example also shows the alt-trans elements. These represent translation alternatives for the source segment in the trans-unit element. A translation alternative is a translation found in a translation memory, a translation generated by a machine translation system, or a translation suggested by a translator or reviewer. These elements contain source and target elements. In this example, target elements are the suggested translations of the trans-unit source. The source element represents the text that was matched against, from a TM system, for example.

The alt-trans element contains attributes such as match-quality and tool. These provide information about the alternative translations, such as which tool produced them, or in the case of match-quality, a measure of the quality of the translation. The algorithm for generating the match-quality value in a given alt-trans element is specific to the tool that generated it. However, for a translation memory system, it is typically the percentage of words in the source element that match the source from its database

XLIFF: The Benefits

Format handling

One of the main problems in localizing files is the complexity of the various file formats. The formatting in the source files can be divided into two types: inline formatting and structural formatting. Inline formatting is part of the flow of the text; for example, <b> tags in HTML. Structural formatting can be considered outside the flow of the text. An example of this is the <body> tag in HTML. XLIFF has mechanisms to handle both of these cases.

When a file is converted to XLIFF, the structural formatting is extracted and stored in a skeleton file. The skeleton file indicates where the text from any given trans-unit should be placed. The format of this file is not defined by the XLIFF specification, so conversion tools can use any format they choose. The conversion tools should be able to recover the original source file, given the skeleton file and the XLIFF file.

None of the structural formatting appears in the XLIFF file, so this form of complexity is hidden from translators and tool writers.

Inline formatting cannot be removed completely, as the translated text might require the same treatment. Translators need to know where inline markup appears in source documents, and they need to be able to insert equivalent markup in the translations they generate.

XLIFF has two mechanisms to handle inline formatting. The first mechanism removes the formatting, stores it in the skeleton file, and puts in a placeholder tag to represent where the formatting would go in the text. Two tags are used for this purpose: g and x. The g tag is used for formatting that occurs in pairs, for example <b> and </b>. The x tag is used for nonpaired formatting, such as <img> tags in HTML.

The second mechanism leaves the formatting in place, but nests it inside either the bpt, ept, or it tags. The bpt and ept tags are used to mark up paired formatting, with bpt surrounding the beginning formatting, and ept surrounding the ending formatting. The it element surrounds nonpaired formatting. In these cases, the characters "<" and ">" need to be converted to their respective entities: &lt; and &gt;. For example, the tag <img src="foo.gif"> would be marked up as <it id="a1" pos="1">&lt;img src="foo.gif"&gt;</it>

These approaches mainly benefit the tools developer. Instead of needing to identify the various types of inline formatting, the developer can leave this to the XLIFF conversion tools and deal with the XLIFF inline formatting markers.

Support for Localization Processes

The XLIFF file allows for a list of phases to be stored in the header of each file element. The phase elements contain attributes to store information on when the phase took place, what tool was used, who the contact person for the phase was, the name of the phase, the name of the process carried out during the phase (for example, translation or review), and a jobidfor the entire process. Note elements can also be included so that users may leave comments on a phase. Each target element can contain a phase-name attribute, which links it to a given phase. This allows users to specify which phase introduced a particular translation, which can be useful for review purposes.

Decoupling Translation Tools

XLIFF allows many separate tools to work on files. While working on source file formats, it is not easy to have multiple tools work on the same files. For example, suppose you had a machine translation tool and a translation memory tool that you wanted to use to localize some files. You would have to integrate the two tools in some way to get them both to work on the file. They could not run sequentially, as the output from either tool would be a translated file. In addition, during review it would be difficult to determine where any given translation came from; that is, whether it came from the MT tool, the TM tool, or a human translator.

If the tools read XLIFF, however, they could be run sequentially. Each tool could write its translation suggestions into an alt-trans element inside the trans-unit, and the best one could be selected when converting back to the original format. In addition, alt-trans elements have attributes for indicating which tool generated the translation, which aids the review process.

Examples

The following screenshot shows an XLIFF editing application. The side-by-side panes represent the source and target elements from trans-units. The text in the left pane is from the source and the text on the right is from the target. The pane at the bottom shows the contents of the alt-trans elements for the highlighted source and target.

As mentioned earlier, XLIFF separates inline formatting from block level formatting. Inline formatting is marked up inside ept and bpt tags. As a result, despite the editor not having any direct knowledge of HTML (the original file format provided for translation), it is able to mark up different inline formatting tags for the translator. These are highlighted for the segment being translated. For example, in the screenshot the segment at the top of the window is highlighted as the one being translated. The <B> and </B> tags are highlighted in red and displayed inline with the text. Structural formatting tags such as <body> would never appear in this application but instead would be stored in the skeleton file.

You can also see that it suggests translation alternatives for one of the segments. These come directly from the alt-trans elements in the XLIFF file, each suggestion having a different match quality.


A screenshot of an XLIFF editor


Note the status bar at the bottom of the window. It currently says that the segment is untranslated. Other possible states could be "translated" or "verified." The editor marks segments for which exact matches were found in the translation memory tool as "100," so the translator could skip these segments altogether. However, they are usually included to give the translator an idea of the flow of the text and how other strings in the file have been translated.

The following screenshot shows a sample of the XLIFF content that was displayed above:


XLIFF source

Note the use of the count-group to list the number of words being translated in the source element. In the preceding example, notice that the count-group shows the word count for the source sentence is 20.

Summary

In summary, XLIFF aids localization in a number of ways.

  • XLIFF removes the complexities of localizing different types of source files.
  • XLIFF provides a common platform for localization tools vendors to write to, thus increasing the number of tools available.
  • XLIFF highlights the parts of a file that are important to the localization process.
  • XLIFF provides support to the localization process, through its commenting features, support for phases, and metrics.


Authors: John Corrigan and Tim Foster are software engineers working on translation technologies at Sun Microsystems.

Also by Tim Foster: Translation Technology at Sun.

Related Links