Sun Java Solaris Communities My SDN Account Join SDN
 
Article

Support for the Wubizixing Input Method in the Solaris(TM) Operating System

 

Wubizixing (abbreviated "Wubi") is the latest of several Chinese input methods supported by the Solaris platform. This article provides a brief background of the Wubi input method, general information about Wubi, and details on the main features of the latest version included in the Solaris OS.

This article is organized as follows:

Introduction

In 1983, after five years of research on Chinese characters and their computerized creation, professor Wang Yongmin released his five-stroke input method. Since then, the Wubi input method has been used by more than 90% of home users in China, and has become a popular input method for creating and transcribing Chinese text worldwide.

Since its inception, Wubi has kept pace with the character set standards established by the government of mainland China. An early release, Version-86 Wubi, supported the GB2312-80 character set. Wubi was updated again after the GB18030 standard was established, and also included support for GBK.

Wubi and Shape-Based Input Methods

As with Asian languages in general, the large Chinese character set makes it impractical to design keyboards with enough keys to generate each character. Instead, users are often helped by front-end utilities called input methods to pre-compose characters using a standard keyboard before sending them to an application.

Input methods are system applications that convert keyboard input into one of the thousands of characters supported by the system. Chinese input methods fall into two main groups: pronunciation-based and shape-based.

In mainland China, Wubi is the primary shape-based input method used to create and transcribe text. Wubi is based on the structure, or shape, of characters rather than on their pronunciation. The main concept behind Wubi is that characters can be built by combining roots.

Wubi allots some 200 radicals, or roots, to five sections corresponding to five types of character strokes in the Chinese writing system:

  1. Lateral
  2. Vertical
  3. Left sweep
  4. Dot/Right sweep
  5. Bend

In other words, the Wubi method divides the set of roots and the keyboard into five main categories according to the shape of the first stroke used to write each character (see Figure 1). Wubi literally means "five strokes," in reference to these five categories. Each of the five roots is further divided into five levels. The resulting 25 root categories are assigned to the 25 keys A-Y on the keyboard (see Figure 2).

The user needs no more than four keystrokes to enter any character in the code chart, and the most frequently used 600 characters require only one or two keystrokes. The user must know which radicals are assigned to each key, but once the array is memorized, the user can type quickly and accurately. In fact, skilled Wubi users can enter text at a rate of 160 characters per minute, a much higher rate than any other Chinese input method.

Figure 1: Wubi Keyboard Layout

Graphic showing Wubi keyboard.

Each of the five stroke types is represented by a region of the keyboard: Region 1 (lateral) is represented by keys 11-15, region 2 (vertical) by keys 21-25, region 3 (left sweep) by keys 31-35, region 4 (dot/right sweep) by keys 41-45, and region 5 (bend) by keys 51-55. In general, the roots are logically placed on the keyboard to facilitate memorization. However, roots that do not fit neatly into the five patterns are often placed on the fifth key of each region. These exception keys are the most difficult to memorize.

Figure 2: Root Categories in Wubi

Graphic of table showing 25 root categories in Wubi.

Features of the Wubi Input Method

One of the main advantages of Wubi and other shape-based input methods is a very low repetition rate, a feature not found in PinYin-based input systems. This means that only one or two Chinese characters is represented by a Wubi key sequence. Since a single Wubi code seldom represents more than one character, users can type text more quickly.

Since Wubi is built on the graphemic encoding system, namely the GB18030-2000 character set standard, almost all Chinese, Kanji, and Hanja characters can be encoded with it.

Each of the following features included in this release is described below.

  • GB18030-2000 character set support
  • Easy character set switching
  • New radical mechanism for Simplified and Traditional Chinese
  • Three-level progressive identification code
  • Phrase input and professional word galleries
  • Help key
  • Fault tolerance code
  • Word-phrase association
  • Properties settings
GB18030-2000 Character Set Support

The GB18030-2000 character set is a national encoding standard issued by the Chinese government in 2000, in which the encoding length is one, two, or four bytes. Of the 27,533 characters in GB18030-2000, 6,763 are standard Simplified Chinese characters, 13,053 are Traditional Chinese (Big5) characters, and 3,000 are characters used in Hong Kong.

The Wubi input method supports the GB18030-2000 character set, and makes it easy to work with the smaller character sets contained in GB18030-2000, as described in "Easy Character Set Switching," below.

Easy Character Set Switching

Solaris Wangma Wubi divides the GB18030-2000 character set into smaller sets of commonly used Chinese characters:

  • GB2312 (6,763 characters)
  • GBK (21,003 characters)
  • GB18030-2000 (27,533 characters)

While typing text, users can quickly switch between character sets by using keyboard shortcuts. For example:

  • To use the GB2312 character set, press Ctrl + Shift + 1
  • To use the GBK character set, press Ctrl + Shift + 2
  • To use the GB18030-2000 character set, press Ctrl + Shift + 3

Because GB18030-2000 is a relatively new standard, support in Wubi for the GB2312 and GBK character sets ensures backward-compatibility with earlier standards. Users might prefer to work in the smaller GB2312 or GBK character sets because of improved performance and lower repetition rates.

New Radical Mechanism for Simplified and Traditional Chinese

The new radical (or root) mechanism is a patented technology invented by professor Wang Yongmin, who invented Wubi. It was developed from the old radical system (version 86) and has evolved into a new encoding system compatible with both Simplified and Traditional Chinese. Users of Wubi version 86 can work with three times more characters with the same encoding and typing rules without additional training.

Three-Level Progressive Identification Code

One of the main features of Wubi is the last-stroke grapheme identification code, which distinguishes characters from others with similar shapes. The identification codes are assigned according to the shape and the last radical of the character. The goal of identification codes is to help users master Wubi, and there are three levels that users can choose:

  • In level A, for advanced users, all three graphemic types (when less than four codes) have identification codes.
  • In level B, for intermediate users, only the left-right shaped Chinese characters have identification codes.
  • In Level C, for beginning users, identification codes are not used.
Phrase Input and Optional Professional Word Galleries

Wubi supports phrase input, which means that not only individual characters, but entire phrases can be assigned a Wubi code. In addition to 90,000 basic phrases, there are 11 professional word galleries, similar to glossaries, for each of the following industries:

  • Traffic and transportation
  • Computer and household electronics
  • Economy and finance
  • Agriculture and machines
  • Medicine and health
  • Mining and metallurgy
  • Foreign trade and travel
  • Military affairs and national defense
  • Law and aesthetics
  • Names of places
  • Idioms

Each gallery contains between 3,000 and 20,000 entries. Users can select word galleries in the Preferences dialog box. For a screenshot of this dialog box, see Figure 5.

Encoding Help Feature

The Solaris Wubi input method supports encoding hint features. While users type, the character encoding appears in the Select Repetition Code Window, which can help users master the encoding methods and codes of Chinese characters. In addition, the Z key (either uppercase or lowercase Z) can be used at any time as a wildcard. (Z is the only key not mapped to a character in Wubi.) Pressing the Z key will query the system for input codes, making it easier for users to learn and use Wubi.

Fault Tolerance Code

This feature, accessed in the Preference dialog box, increases the chances that the system will provide the correct character even if the user makes typing errors.

Word-Phrase Association

This feature is another productivity aid in which the system will provide a list of characters that are most likely to follow the character just chosen. Instead of the user having to type a code, the system provides a list of likely options, from which the user can choose the correct character. This feature is also accessed in the Preferences dialog box.

Properties Settings

Users can control the following settings in the Properties dialog box:

  • Character sets: GB2312, GBK, or GB18030
  • Professional word galleries
  • Identification code mode
  • Display the Wubi code for a candidate
  • Display the candidates after each keystroke
  • Association of characters with phrases
  • Fault tolerance code
  • Display characters and phrases with the same code
  • Display the key prompt in the Preedit area

The following graphics show the settings in the Properties dialog box.

Figure 3: Properties Dialog Box

Screenshot of Properties dialog.

Figure 4: Character Set Properties

Screenshot of Properties dialog, showing available Wubi character sets.

Figure 5: Professional Word Galleries Properties

Screenshot of Preferences dialog, showing professional word galleries.

Figure 6: Identification Code Properties

Screenshot of Properties dialog, showing the three identification code modes.
Summary

Since its inception, the Wubi input method has continued to evolve to meet users' needs while remaining compatible with earlier versions. This allows users already familiar with Wubi to use the latest version without special training.

Because of its low repetition rate, Wubi is the most efficient and widely-used shape-based Chinese input method. The repetition rate is below 2% for the 6,763 Simplified Chinese characters in the GB2312-1980 character set. Even in the large GB18030-2000 character set, the repetition rate is lower in Wubi than other input methods. If users switch to a smaller character set, the actual repeated codes will further reduce and the input efficiency will increase.

References

For a tutorial on the Wubi input method, see http://www.people.fas.harvard.edu/~wicentow/wubixing.html.

Conceptual graphics used in this article by permission of The Book & The Computer Online Journal.

May 2003

Related Links