Introduction
Perl's support for Unicode continues to grow. The current stable version
of Perl is 5.6.1. In this release Unicode support is "incomplete, and continues
to be highly experimental"1. Improved Unicode support2 is
expected with the next stable release - version 5.8 (version 5.7 is a development
cycle). However if you're really keen you could dip into one of the development
branches where more advanced, but more experimental Unicode support can be found.
This article details what support is available for Unicode in version 5.6.1.
The sample perl scripts provided all output UTF-8 encoded HTML. Unicode support
in the most recent browsers is good (and considerably better than the previous
generations)- so upgrade to Netscape 6.2.33 or MSIE 6. Viewing UTF-8
encoded web pages may require some configuration of the browser. If you see
question marks or squares in the examples below then your browser should be
upgraded, or reconfigured. See our web faq for details if you have any issues, or you can view the screenshots
provided.
Perl & Solaris
With the introduction of Solaris™ 8, Perl 5.00503 (or higher) is supplied with
the operating system. It is installed at /usr/perl5/5.00503. Should you wish
to install 5.6.1, then it is advisable not to replace the default version. It
might upset existing programs expecting the older version4.
Solaris™ 9 comes with 5.6.1 as default. 5.6.1's Unicode support is contained
in /usr/perl5/5.6.1/lib/unicode/.
The UTF-8 Pragma
The utf8
pragma, first introduced in version 5.005, tells the Perl parser to allow UTF-8
in the current lexical scope. no utf8 tells Perl to switch back
to treating the source text as literal bytes.
The utf8 pragma is only required in certain
circumstances; data is stored as UTF-8 internally by default. In the future this
pragma will be a no-op, and will only exists to facilitate compatibility with older
versions. If you wish, byte semantics can always be forced with the bytes
pragma. When in effect, encoding is ignored and each string is treated as a
series of bytes.
For example, by default Perl will encode $character = chr(9786);
in UTF-8. Passing a number above 255 to chr() results in the character
being stored in a sequence of two or more bytes. Obviously the length of the
single character held in $character is 1 - in a character context
that is. If however the length is calculated within the scope of the bytes
pragma then the length is not 1. source | output. UTF-8 Data in Source Code
There are a number of ways to insert Unicode characters into your source code
none of which require the utf8 pragma.
By using the character's full name: source|output.
use charnames ':full';
print "\N{WHITE SMILING FACE}"; |
By using specific hexadecimal codepoints: source|output.
By using the chr() function to generate a single character: source|output.
my $ctr = chr(0x263A); #chr(9786) is the same. |
To generate a string you can use the pack function: source|output.
my $str1 = pack("U*", 0x004F, 0x00E0, 0x264F, 0x0407); |
Or you could read in from a file containing UTF-8 text.
Identifiers
Variable names can use any Unicode character, as long as the.General Category. (from the Unicode Character Database) is of type letter or number.
However this is one of the situations where the utf8 pragma must be
explicitly used. Obviously it may be inconvenient for others reading/editing
your source if you decide to use such characters. source|output. [When viewing the source set the encoding to "Unicode"
in the browser.]
String
Operations
Increasingly, Perl works with characters rather than bytes. In future versions
of Perl, character semantics will be the default, and the uft8 pragma will be
redundant. Most operators that deal with string postions or length etc. default
to character semantics such as: chop(), rindex(),
index(), write(), pos(), substr(),
sprintf(), length(), ord(), scalar
reverse(). The following examples demonstrate the difference when some
of these functions are forced into a byte context.source|output.
Regular Expressions
With character semantics enforced via the utf8 pragma, regular expressions
match on characters instead of bytes.
To match a single specific Unicode character use \x{ACBD}, where ABCD is a
hexadecimal number. Don't forget the curly braces. m/\x{263A}/ matches the Unicode
WHITE SMILING FACE character. m/\x{9786}/ would do the same, where 9786 is the
decimal representation of the hex 263A.
Under utf8 control, character classes are interpreted differently.
\w can be used to match an ideograph (when utf8 is
in effect) . but will obviously match a lot more characters besides. \C
matches a single C char (octet) even under utf8 control.
If you only want to match say, ideographs then Perl provides the new \p{}
(matches property) and \P{} (doesn't match property) constructs.
In this instance we would use .InCJKUnifiedIdeographs. in the curly
braces (i.e. Is the character in the CJK Ideograph block of the Unicode Character
Database). Similarly there are also: InBasicLatin, InHebrew,
InThai, InKhmer etc. Look in /usr/perl5/5.6.1/lib/unicode/In/
for a full listing of the predefined "In" character classes. Single
letter properties work without brackets, so in matching for Mark characters
you can use \p{M} or \pM.
|
#to match an ideograph
m/\p{InCJKUnifiedIdeographs}/
#to NOT match an ideograph
m/\P{InCJKUnifiedIdeographs}/
#To NOT match a Cyrillic character
m/\P{InCyrillic}/
#etc.....
|
You can also ask questions like "Does the character have the General Category
value: Lu?". Similar to above you would use the match m/\p{IsLu}/.
Look in /usr/perl5/5.6.1/lib/unicode/Is/ for a full listing of
the predefined "Is" character classes.
|
#to match a character with a General Category of "Lu"
m/\p{IsLu}/
#to match a character without a General Category of "Lu"
m/\P{IsLu}/
|
Perl uses the tables in /usr/perl5/5.6.1/lib/unicode/Is/ and /usr/perl5/5.6.1/lib/unicode/In/
for the property matching. These tables are generated by /usr/perl5/5.6.1/lib/unicode/mktables.PL
which itself uses key Unicode files like UnicodeData.txt5 and Blocks.txt.
\X matches any base character followed by a combining character.
That is, it matches any combining character sequence. For example the Angstrom
character can be composed of LATIN CAPTIAL LETTER A (U+0041), followed by the
combining character COMBINING RING ABOVE (U+030A).
#To match a combining character sequence
m/\X
#Note that the above is the same as
(?:\PM\pM*)
#which matches a sequence composed of a non-mark character followed by
a mark character. |
Various examples of regular expressions are provided: source|output.
Note: Regular expressions are far from perfect in 5.6.1. More complex regular expressions simply don't work. Release 5.8.0 is
expected to address this.
The Future
5.6.1 supports Unicode 3.01. The next stable release, 5.8.0, due very soon
will support Unicode 3.2.0. There will also be I/O support for converting to
Perl's encoding on input or from Perl's encoding on output. You will also be
able to specify the encoding of your Perl source by using the encoding pragma.
The utf8 pragma will only be required when you want UTF-8 in your source, -
as you might for identifiers or some regular expressions.
References
-
perldoc perlunicode or this.
- Unicode FAQ:
What Unicode conformance requires.
- You could also try Opera.
perldoc perlsolaris or this.
- The file is not necessarily called UnicodeData.txt. On Solaris™ 9 FCS with
Perl 5.6.1 the file is Unicode.301 (3.01 being the version of Unicode supported
by 5.6.1).
|