Sun Java Solaris Communities My SDN Account Join SDN
 
FAQs

Encodings

NOTE: This material does not necessarily refer to the most recent version of Java. For the most recent FAQs, see here.
1. General Character Encoding
2. Latin Language Charset
  • 2.1 I'd like to know if there's a way to make java understand Latin Language chars. I'm having problems to read the word "gua" (water, in portuguese) from a text file. Can you help me with this?
3. Code Pages (CP)

General Character Encoding
Q:1.1 Is the list of charactor encodings on the web site http://java.sun.com/products/jdk/1.1/intl/html/intlspec.doc7.html up to date? If not, where can I get the latest list of character encodings?

See http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html.

Back to Top


Q: 1.2 Is it possible to progammatically return the complete list of available code pages?

It is not possible in any current release but it has been added to the release we are working on right now. Look for it in the beta of Java 1.4 that shows up early in the new year.

Back to Top


Q:1.3 How do I compile a Java file in the utf-8 encoding?

With Sun's Java compiler, you need to specify the encoding of the source file. Try using:

javac -encoding UTF-8 MyFile.java

I'm not sure whether other compilers support the -encoding flag. But you can always use the native2ascii tool as a preprocessor on your source files if the other compilers can't support multiple encodings.

Back to Top


Q:1.4 Are there any plans to create some kind of class that encapsulates character encodings?

There is currently work underway to define public character converter APIs. The work is done as part of JSR 51, New I/O APIs. See New I/O APIs for the JavaTM Platform for more information.

Back to Top


Q:1.5 Can you inform me of where the system value called "file.encoding" gets set? It is getting set to "646" in the JVM 1.2 on Solaris 7, and it is said to be an invalid charset by a third party server. Can you help me with this?

My guess is that you are running your app in C/POSIX locale. On Solaris 7, the system call nl_langinfo(CODESET) returns "646" when the user locale is set to C/POSIX. Though we have a alias mapping table to map 646 to "ASCII" which is a valid charset name, but the mapping table is in sun.io package which I don't recommend you to use directly. I think setting the locale to en_US should solve the problem.


Latin Language Charset
Q:2.1 I'd like to know if there's a way to make java understand Latin Language chars. I'm having problems to read the word "gua" (water, in portuguese) from a text file. Can you help me with this?

Current versions of Java should have no problem with these characters. Check that you are using Java version 1.1 or newer. Java 1.2.2 or 1.3 would be best. You must also make sure that the font you are using to display these characters contains the glyphs you need. The Lucida Sans font, that comes with the J2SDK version 1.2 or later, contains these glyphs. You can use this font by creating the following font object:

Font f = new java.awt.Font("Lucida Sans", Font.PLAIN, 18);

If you are reading the data from a text file, you must also take care that the text is encoded correctly. Java requires all its text to be in Unicode so your text file must be encoded with Unicode or your program must convert it to Unicode before trying to treat the data as text. Look at the InputStreamReader class to see how to convert text files into Unicode when they are being read.

Back to Top


Code Pages (CP)
Q:3.1 Can someone tell if the Java character set encoding "cp285" is the one to use to support EBCIDIC UK 00285?

Our documentation at http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html says it's "IBM United Kingdom, Ireland", and looking at the mapping table I'd say it's EBCDIC (it's certainly not ISO 646 / ASCII based).

Back to Top


Q:3.2 Why do the debug classes of JDk 1.2.2 not include "Cp850"? Whenever I run my application in debug mode, it crashes because of this, but runs perfectly when not running in debug mode. Is there a way to add "Cp850" text en/decoding to the debug classes?

Are you using Sun's Java 2 SDK itself to debug your application, or are you using some third-party IDE? IDE's often support a different set of encodings than the Java 2 SDK. For the Java 2 SDK, for all I know, we use the same class files whether you run in debug or no-debug mode.

Back to Top


Q:3.3 I am working on a Java application which requires translation from byte[] to unicode characters. My application uses the number required by a Windows environment (eg '932') to specify the code page. I am only given the number, say '932'. Can you tell me how to cover all of the different code pages and obtain a corresponding mapping from the number to the code page used by Java, eg. 'ms932'?

The reason that '932' isn't a good enough name is that both IBM and Microsoft have code pages called '932' -- and they are not the same. In Java, the convention is that Microsoft versions are called "MS932" and IBM versions are called "CP932". Your application needs to tell us which one to use so that we can get the conversion right.

Your program is going to have to decide how to do the mapping based on what you know about the user. Maybe you know that the user is using Microsoft Windows instead of IBM OS/2 for example. Maybe you can guess which is more likely.

Back to Top


Q:3.4 It is written in your ocumentation on jdk1.1.7, that WIN cp1252 is a default code page for jdk 1.1.7 java compiler. Could you kindly tell me how to set win cp 1251 as a default code page? I use WIN NT4.0 on my machine.

The documentation is actually not quite correct in that point. The default code page is the one that Windows uses for the default region. So, if you run on any localized Windows version whose default region uses CP1251 (for example, Russian), the JRE and all the tools use CP1251 by default.

For some tools, you can specify a different encoding on the command line. For example, if you run on an English version of NT, you can still compile source code written in CP1251 by specifying "-encoding CP1251" on the command line.

Another important point to know is that you generally do not have to rely on the default encoding. Our API lets you specify explicitly whatever encoding you wish. So you may still read files in CP1251 even when the default encoding is CP1252 or something else.

Back to Top

Back to Question Category Page.