Sun Java Solaris Communities My SDN Account Join SDN
 
FAQ

How Can I Determine the Encoding of a File?

 
 
The Problem
Those of us working in the field of multilingual computing regularly have to figure out the character encoding of a file.
This is often the case when handling files [software message files, html files, html fragment files etc.] returned from translation vendors. No offence vendors, but sometimes files are returned in an encoding different to what was requested! So while you may know the language of the page, you may not know the character encoding.

In today's sophisticated web content authoring systems, fragments of html files are often sent to vendors for translation. Different fragments might contain the masthead, the left navigation, the footer and another one for the actual main content.
When the localized files are eventually compiled into a single HTML file, if they're not all in the same character encoding, you'll get some garbled results. So it's important to know the character encoding of your files.
The Solution
There are a few tools & tricks that can be used:
  • Solaris's auto_ef utility
    • Available in Solaris 10, the Auto Encoding Finder is a very useful utility for determining a file's encoding. The utility judges the character encoding by using the iconv character encoding conversion utility, determining whether a certain code conversion was successful with the file, and also by performing frequency analyses on the character sequences that appear in the file.
      Though not fool-proof, the utility is the best available. Here are some examples of it's use on a Korean web page. The first one uses the -a flag, to list all the possible character encodings with the degree of probability.
      bash-2.05b$ auto_ef -a kr.html
      ko_KR.euc  0.95
      zh_TW-euc  0.03
      zh_CN.euc  0.01
      
      The -l 3 options ask the utility to be more thorough, and thus it runs a little slower. The -l 3 options, give the most accurate result.
      bash-2.05b$ auto_ef -l 3 kr.html
      ko_KR.euc
      bash-2.05b$
      
      For the full details on this utility, read the man page.
  • Use Mozilla*
    Again, this is not a fool-proof method, and is not as elegant as auto_ef, but Mozilla is available for most platforms - for free.
    Using this technique, it is important to know the language of the page, so that you can accurately determine the encoding.
    • Open your file in Mozilla. If the text displays properly, then simply do View / Character Encoding, and the marked character encoding is the character encoding of your page.
    • If the text does not display correctly, then do the following:
      • Firstly make sure that you have the correct font support. If you believe that your page contains say Japanese, then go to a Japanese web site and ensure that the text displays correctly.
      • If you've established that you have the correct font support, go to View / Character Encoding, and change the character encoding to what you think it could be. When the page eventually displays correctly, you've found the correct character encoding of your page.
    *Most other browsers have similar functionality.
Related Links