Introduction
Perl's support for Unicode continues to grow. The current stable version of Perl is 5.6.1. In this release Unicode support is "incomplete, and continues to be highly experimental"1. Improved Unicode support2 is expected with the next stable release - version 5.8 (version 5.7 is a development cycle). However if you're really keen you could dip into one of the development branches where more advanced, but more experimental Unicode support can be found.
This article details what support is available for Unicode in version 5.6.1. The sample perl scripts provided all output UTF-8 encoded HTML. Unicode support in the most recent browsers is good (and considerably better than the previous generations)- so upgrade to Netscape 6.2.33 or MSIE 6. Viewing UTF-8 encoded web pages may require some configuration of the browser. If you see question marks or squares in the examples below then your browser should be upgraded, or reconfigured. See our web faq for details if you have any issues, or you can view the screenshots provided. Perl & Solaris
With the introduction of Solaris™ 8, Perl 5.00503 (or higher) is supplied with the operating system. It is installed at /usr/perl5/5.00503. Should you wish to install 5.6.1, then it is advisable not to replace the default version. It might upset existing programs expecting the older version4. Solaris™ 9 comes with 5.6.1 as default. 5.6.1's Unicode support is contained in /usr/perl5/5.6.1/lib/unicode/. The UTF-8 Pragma
The The utf8 pragma is only required in certain
circumstances; data is stored as UTF-8 internally by default. In the future this
pragma will be a no-op, and will only exists to facilitate compatibility with older
versions. If you wish, byte semantics can always be forced with the
For example, by default Perl will encode UTF-8 Data in Source Code
There are a number of ways to insert Unicode characters into your source code
none of which require the
chr() function to generate a single character: source|output.
pack function: source|output.
IdentifiersVariable names can use any Unicode character, as long as the.General Category. (from the Unicode Character Database) is of type letter or number. However this is one of the situations where the String OperationsIncreasingly, Perl works with characters rather than bytes. In future versions
of Perl, character semantics will be the default, and the uft8 pragma will be
redundant. Most operators that deal with string postions or length etc. default
to character semantics such as: Regular ExpressionsWith character semantics enforced via the utf8 pragma, regular expressions match on characters instead of bytes. To match a single specific Unicode character use Under
If you only want to match say, ideographs then Perl provides the new
You can also ask questions like "Does the character have the General Category
value: Lu?". Similar to above you would use the match
Perl uses the tables in
Various examples of regular expressions are provided: source|output. Note: Regular expressions are far from perfect in 5.6.1. More complex regular expressions simply don't work. Release 5.8.0 is expected to address this.
The Future5.6.1 supports Unicode 3.01. The next stable release, 5.8.0, due very soon
will support Unicode 3.2.0. There will also be I/O support for converting to
Perl's encoding on input or from Perl's encoding on output. You will also be
able to specify the encoding of your Perl source by using the References |
| |||||||||||||||||||||||||||||||
Oracle is reviewing the Sun product roadmap and will provide guidance to customers in accordance with Oracle's standard product communication policies. Any resulting features and timing of release of such features as determined by Oracle's review of roadmaps, are at the sole discretion of Oracle. All product roadmap information, whether communicated by Sun Microsystems or by Oracle, does not represent a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. It is intended for information purposes only, and may not be incorporated into any contract.
|
| ||||||||||||