Sun Java Solaris Communities My SDN Account Join SDN
 
Article

Unicode Support in Perl 5.6.1

 

Introduction

Perl's support for Unicode continues to grow. The current stable version of Perl is 5.6.1. In this release Unicode support is "incomplete, and continues to be highly experimental"1. Improved Unicode support2 is expected with the next stable release - version 5.8 (version 5.7 is a development cycle). However if you're really keen you could dip into one of the development branches where more advanced, but more experimental Unicode support can be found.

This article details what support is available for Unicode in version 5.6.1.

The sample perl scripts provided all output UTF-8 encoded HTML. Unicode support in the most recent browsers is good (and considerably better than the previous generations)- so upgrade to Netscape 6.2.33 or MSIE 6. Viewing UTF-8 encoded web pages may require some configuration of the browser. If you see question marks or squares in the examples below then your browser should be upgraded, or reconfigured. See our web faq for details if you have any issues, or you can view the screenshots provided.

Perl & Solaris

With the introduction of Solaris™ 8, Perl 5.00503 (or higher) is supplied with the operating system. It is installed at /usr/perl5/5.00503. Should you wish to install 5.6.1, then it is advisable not to replace the default version. It might upset existing programs expecting the older version4. Solaris™ 9 comes with 5.6.1 as default. 5.6.1's Unicode support is contained in /usr/perl5/5.6.1/lib/unicode/.

The UTF-8 Pragma

The utf8 pragma, first introduced in version 5.005, tells the Perl parser to allow UTF-8 in the current lexical scope. no utf8 tells Perl to switch back to treating the source text as literal bytes.

The utf8 pragma is only required in certain circumstances; data is stored as UTF-8 internally by default. In the future this pragma will be a no-op, and will only exists to facilitate compatibility with older versions. If you wish, byte semantics can always be forced with the bytes pragma. When in effect, encoding is ignored and each string is treated as a series of bytes.

For example, by default Perl will encode $character = chr(9786); in UTF-8. Passing a number above 255 to chr() results in the character being stored in a sequence of two or more bytes. Obviously the length of the single character held in $character is 1 - in a character context that is. If however the length is calculated within the scope of the bytes pragma then the length is not 1. source | output.

UTF-8 Data in Source Code

There are a number of ways to insert Unicode characters into your source code none of which require the utf8 pragma.

  • By using the character's full name: source|output.
  • use charnames ':full';
    print "\N{WHITE SMILING FACE}";

     

  • By using specific hexadecimal codepoints: source|output.
  • my $ctr = "\x{263A}";

     

  • By using the chr() function to generate a single character: source|output.
  • my $ctr = chr(0x263A); #chr(9786) is the same.

     

  • To generate a string you can use the pack function: source|output.
  • my $str1 = pack("U*", 0x004F, 0x00E0, 0x264F, 0x0407);

     

     

  • Or you could read in from a file containing UTF-8 text.
  • Identifiers

    Variable names can use any Unicode character, as long as the.General Category. (from the Unicode Character Database) is of type letter or number.

    However this is one of the situations where the utf8 pragma must be explicitly used. Obviously it may be inconvenient for others reading/editing your source if you decide to use such characters. source|output. [When viewing the source set the encoding to "Unicode" in the browser.]

    String Operations

    Increasingly, Perl works with characters rather than bytes. In future versions of Perl, character semantics will be the default, and the uft8 pragma will be redundant. Most operators that deal with string postions or length etc. default to character semantics such as: chop(), rindex(), index(), write(), pos(), substr(), sprintf(), length(), ord(), scalar reverse(). The following examples demonstrate the difference when some of these functions are forced into a byte context.source|output.

    Regular Expressions

    With character semantics enforced via the utf8 pragma, regular expressions match on characters instead of bytes.

    To match a single specific Unicode character use \x{ACBD}, where ABCD is a hexadecimal number. Don't forget the curly braces. m/\x{263A}/ matches the Unicode WHITE SMILING FACE character. m/\x{9786}/ would do the same, where 9786 is the decimal representation of the hex 263A.

    Under utf8 control, character classes are interpreted differently. \w can be used to match an ideograph (when utf8 is in effect) . but will obviously match a lot more characters besides. \C matches a single C char (octet) even under utf8 control.

    m/\w/

     

    If you only want to match say, ideographs then Perl provides the new \p{} (matches property) and \P{} (doesn't match property) constructs. In this instance we would use .InCJKUnifiedIdeographs. in the curly braces (i.e. Is the character in the CJK Ideograph block of the Unicode Character Database). Similarly there are also: InBasicLatin, InHebrew, InThai, InKhmer etc. Look in /usr/perl5/5.6.1/lib/unicode/In/ for a full listing of the predefined "In" character classes. Single letter properties work without brackets, so in matching for Mark characters you can use \p{M} or \pM.

    #to match an ideograph
    m/\p{InCJKUnifiedIdeographs}/
    #to NOT match an ideograph
    m/\P{InCJKUnifiedIdeographs}/
    #To NOT match a Cyrillic character
    m/\P{InCyrillic}/
    #etc.....


     

     

     

    You can also ask questions like "Does the character have the General Category value: Lu?". Similar to above you would use the match m/\p{IsLu}/. Look in /usr/perl5/5.6.1/lib/unicode/Is/ for a full listing of the predefined "Is" character classes.

    #to match a character with a General Category of "Lu"
    m/\p{IsLu}/
    #to match a character without a General Category of "Lu"
    m/\P{IsLu}/

     

     

     

    Perl uses the tables in /usr/perl5/5.6.1/lib/unicode/Is/ and /usr/perl5/5.6.1/lib/unicode/In/ for the property matching. These tables are generated by /usr/perl5/5.6.1/lib/unicode/mktables.PL which itself uses key Unicode files like UnicodeData.txt5 and Blocks.txt.

    \X matches any base character followed by a combining character. That is, it matches any combining character sequence. For example the Angstrom character can be composed of LATIN CAPTIAL LETTER A (U+0041), followed by the combining character COMBINING RING ABOVE (U+030A).

    #To match a combining character sequence
    m/\X
    #Note that the above is the same as
    (?:\PM\pM*)
    #which matches a sequence composed of a non-mark character followed by a mark character.

     

     

     

     

    Various examples of regular expressions are provided: source|output.

    Note: Regular expressions are far from perfect in 5.6.1. More complex regular expressions simply don't work. Release 5.8.0 is expected to address this.

     

    The Future

    5.6.1 supports Unicode 3.01. The next stable release, 5.8.0, due very soon will support Unicode 3.2.0. There will also be I/O support for converting to Perl's encoding on input or from Perl's encoding on output. You will also be able to specify the encoding of your Perl source by using the encoding pragma. The utf8 pragma will only be required when you want UTF-8 in your source, - as you might for identifiers or some regular expressions.

    References

    1. perldoc perlunicode or this.
    2. Unicode FAQ: What Unicode conformance requires.
    3. You could also try Opera.
    4. perldoc perlsolaris or this.
    5. The file is not necessarily called UnicodeData.txt. On Solaris™ 9 FCS with Perl 5.6.1 the file is Unicode.301 (3.01 being the version of Unicode supported by 5.6.1).
    Related Links