Sun Java Solaris Communities My SDN Account Join SDN
 
Article

Living with XML

 
By Mo Budlong, August 2002  

THIS INFORMATION AND EXAMPLES OF CODE ARE BEING PROVIDED BY SUN AS A COURTESY, "AS IS," AND SUN DISCLAIMS ANY AND ALL WARRANTIES PERTAINING THERETO, INCLUDING ANY WARRANTIES OF MERCHANTABILTY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT.

XML is a powerful new tool for representing data and their relationships. XML syntax is straightforward, and XML parsers are freely available. Understanding XML will help you put XML to use in your own projects.

XML has pretty much revolutionized the transmission of data between applications. It is now possible to transmit the data, the field name, the type of data and much, much more. Consider for example, the following listing containing information being sent from company A to company B.

VRThe Victoria CARAVEL 48SAIL 24000
Code Example 1 Sample flat data

If you can fathom what these data are all about by looking at this record, then you should apply for your spot in the data psychic's hall of fame, immediately.

This type of information is usually called flat data, and a file containing lots of this stuff would be called a flat file. It is flat because it is just bytes with no perspective on what the data are, or what is important.

Conversely, in Code Example 2 below, an XML version of the same data provides a hierarchical description of the information

<vessel>
      <record_type>VR</record_type>
      <registered_name>The 
            Victoria</registered_name>
      <type>CARAVEL</type>
      <compliment>48</compliment>
      <capacity>
            <power>SAIL</power>
            <range units="Nautical 
                  Miles">24000</range>
      </capacity>
</vessel>
Code Example 2 Sample XML

In Code Example 2 you are looking at exactly the same information, but suddenly it makes sense. This is information about a sailing vessel, its crew compliment, and range in nautical miles. The tags and the text associated with the tags are called elements, the text of the tag is called text or the value of the element, the additional data such as <code>units</code> are called attributes.

The XML carries both the data, and data about the data (frequently called metadata).

There have been earlier efforts to produce a format that would carry data and data about data, the most famous of which is probably Comma Separated Values or CSV format. A CSV format file provided the same information in comma separated (and quoted for strings) fields as in Code Example 3.

"VR","The Victoria","CARAVEL",48,"SAIL",24000

Code Example 3 Sample CSV format

CSV added a "First row contains column names" option so that data looked like Code Example 4.

"rec_type", "name", "type", "compliment", "power", "range"
"VR", "The Victoria", "CARAVEL", 48, "SAIL", 24000

Code Example 4 Sample CSV format with column names in row one

But the program and the programmer had to know in advance that the first row contained the column names, or the program had to do many strange-looking tests to decide if row one contained a list of column names.

Once you get down to the range field, you have a lost dimension that is immediately visible in the XML. A non-nautical buyer of "The Victoria" might pass it over thinking that the range was expressed in miles and consequently not sufficient to circumnavigate the earth. If that buyer had been an agent working for Ferdinand Magellan, "The Victoria" might never have been purchased at all. Instead, the first ship to sail around the world could have ended up on milk runs in the Mediterranean.

XML offers advantages to the human who must read that data, in that it usually communicates a great deal of additional information about the data.

For the programmer, the addition of all the data about data makes it possible to programmatically come up with reasonable methods of extracting the desired information.

XML is just the text as you see it in Code Example 2. In fact XML is probably more likely to look like the sample in Code Example 5. The tags are strung together end to end without the tidy structure shown in Code Example 2. This is because of the way that an XML parser handles white space, which is covered later.

<vessel><record_type>VR</record_type><registered_name> The Victoria</registered_name><type>CARAVEL</type><compliment>48</compliment> <capacity><power>SAIL</power> <range units="Nautical Miles">24000</range></capacity></vessel>

Code Example 5 Sample XML text

In order to use XML effectively, you must load the XML text into a data structure that allows you to take advantage of the tree-like structure of the data. The data structure is called a document object. Loading the data consists of reading the XML text into a parser that checks for well-formed XML. Some of the simpler rules of well-formed XML are:

  1. All tags start and end.
  2. An inner tag starts and ends before the outer tag ends.
  3. The attributes of an element each have a unique name.
  4. There can be only one top level element.

When an XML document is loaded into a document object, the entire structure is expressed as one or more nodes, having parent nodes, children nodes, and sibling nodes. Elements, attributes, text values, and other pieces of the document each become nodes. There are three main characteristics of a node.

  1. Node type
  2. Node name
  3. Node value

There are 13 node types

  1. Element
  2. Attribute
  3. Text
  4. CDATA Section
  5. Entity Reference
  6. Entity
  7. Processing Instruction
  8. Comment
  9. Document
  10. Document Type
  11. Document Fragment
  12. Notation
  13. XML Declaration

You have already seen 1, 2, 3, and 9. Some of the other common nodes are described below.

Comment Node


A comment node is any text between <!-- and -->. A comment is shown in Code Example 6.

<vessel>
<!-- 
I hear Magellan is in the market 
for five vessels for an around the 
world cruise. See if he likes the 
look of The Victoria. Don't talk to 
his buyer. He doesn't know a Nautical 
Mile from a potato.
-->
      <record_type>VR</record_type>
      <registered_name>The 
           Victoria</registered_name>
      <type>CARAVEL</type>
      <compliment>48</compliment>
      <capacity>
           <power>SAIL</power>
           <range units="Nautical 
                 Miles">24000</range>
      </capacity>
</vessel>
Code Example 6 An XML comment

An XML Declaration

An XML declaration node can be added as the first line of the document as in Code Example 7.

The XML declaration can contain three pieces of information used to process the XML. An xml version idicates which version of XML is used. 1.0 is the current version. A standalone attribute can be set to standalone='true' or 'false' and indicates that the document does or does not make reference to any other document. Parsing is quicker when standalone='true'.

The encoding attribute allows for character sets other that your locale default. Some sample values for encoding might be:

encoding='UTF-8'      (also the default)
encoding='USASCII'
encoding='ISO8859-1'
<?xml version='1.0'?>
<vessel>
<!-- 
I hear Magellan is in the market 
for five vessels for an around the 
world cruise. See if he likes the 
look of The Victoria. Don't talk to 
his buyer. He doesn't know a Nautical 
Mile from a potato.
-->
<record_type>VR</record_type>
      <registered_name>The 
            Victoria</registered_name>
      <type>CARAVEL</type>
      <compliment>48</compliment>
      <capacity>
            <power>SAIL</power>
            <range units="Nautical 
                  Miles">24000</range>
      </capacity>
</vessel>
Code Example 7 An XML declaration

Entity References
Entity references are used to handle special characters. In XML, several characters are subject to special interpretation by the parser: ampersand (&), less than (<) greater than (>), single quote ('), and double quote ("). To represent these values within the text of a node, you must use special encodings as shown in Table 1.


Entity Entity Reference Represents
lt &lt; <
gt &gt; >
amp &amp; &
apos ' '
quot &quot; "

The ' entity reference is optional and does not work with all browsers. An alternative method is &#39; which uses the numeric value of an apostrophe.

Table 1 XML entities

CDATA Section
A CDATA section is a method of marking out a section of text that is not subject to the parser. This is useful if you have text that contains many symbols that would have to be represented by entities. If you were trying to transmit a C++ code fragment, Code Example 8 shows the code and two examples of how to send the text. The first example uses entity references, the second uses a CDATA section. A CDATA section begins with <![CDATA[ and ends with ]]> text.

if (ch <= 'a') ch &= 'a';

<example_01>
if (ch <= 'a') ch &= 'a';
</example_01>

<example_02>
<![CDATA[if (ch <= 'a') ch &= 'a';]]>
</example_02>
Code Example 8 C code in XML using entity references and CDATA section

The Trouble with Spaces
Text can appear in a parent element. Code Example 9 is valid XML. The parent tag <vessel> contains text as well as children. This would seem to be a bonus until you realize the effect that this has on spaces.

<vessel>This ship is for sale
      <record_type>VR</record_type>
      <registered_name>The 
           Victoria</registered_name>
      <type>CARAVEL</type>
      <compliment>48</compliment>
      <capacity>
            <power>SAIL</power>
            <range units="Nautical 
                 Miles">24000</range>
      </capacity>
</vessel>
Code Example 9 Sample XML with text in a parent element

Text appearing between tags is considered to be text, and this includes white spaces. Therefore the text between <vessel> and <record_type> is all text including the line terminator and spaces leading up to <record_type>.

The line terminator and spaces between </record_type> and <registered_name> are also considered text. When a document is parsed, it is useful to avoid all this white space clutter, so it is common to see a file of XML text that contains all text and data concatenated end to end, without the pretty formatting. A sample of this style was shown in Code Example 5. There are ways to eliminate extraneous white space in most XML parsers, but they are beyound the scope of this article.

Figure 1 is a representation of the nodes that are created by parsing Code Example 7 if Code Example 7 was jammed together without intervening spaces. Each rectangle represents a node. Each node, except for the XML declaration node, has three characteristics: a node type, a name, and a value.

The top level element <vessel> is not the top level of the document object. The document object actually creates a pseudo-node with a type of document and a name of #document. <vessel> appears under that as the one and only element.

The XML declaration node also appears directly under the document node, but it is not an element.

When an element such as <registered_name> or <type> contains text, the text is NOT the value of the element. The parser loads the text and creates a text node under the element and loads the text value in as the value of the text node. This is important to remember when processing a Document Object because you can locate an element and directly extract the text as the value. Instead, you must descend one level to a child named #text, and the value of that node contains the text, such as "The Victoria" or "CARAVEL".

The attribute node <units> behaves like an element node except that its value is directly available within the node. The value also appears as a text node child under the attribute node.

A document object after loading XML
Figure 1 A document object after loading XML
(Click image to enlarge.)

Once you have decided to use XML, you must choose an XML parsing package. There are several, but the obvious choice at the moment is Xerces from the Apache project.

The package with complete source code is available from The Jakarta Project and downloads, installs, and builds easily.

The sample code provided with Xerces is almost too thorough. Instead of a series of gentle steps, even the simplest samples use features of Xerces that will leave you scratching your head.

In Code Example 10, the xercecizer is a very simple exerciser for Xerces. It loads an XML file and outputs the resulting tree. It does not handle or demonstrate all node types, but it does show you how to use several of the common nodes. Error handling and other good coding practices have been left out to make the code clearer.

This code recursively walks the document object tree by using DOM_Node::getFirstChild() and DOM_Node::getNextSibling() logic. This is the normal process for walking down and across the tree.


The code is also upside-down, with main as the last function, in order to avoid extra function declarations.

// xercecizer.cpp : Some simple Xerces Exercises

#include      <iostream>
using      std::cout;
using      std::endl;
using      std::cerr;
using      std::ostream;

#include      <sstream>
using      std::ostringstream;

#include      <string>
using      std::string;


#include      <util/PlatformUtils.hpp>
#include      <util/XMLString.hpp>
#include      <parsers/DOMParser.hpp>
#include      <dom/DOM_NamedNodeMap.hpp>

// forward declaration of some functions
void do_node(int level, DOM_Node& node);
const char* type_name(const DOM_Node& node);

// This is a simple class to do easy though 
// not terribly efficient trancoding of XMLCh
// and DOMString data to local code page for display.
// using the Xerces .transcode() member functions.
// Parts derived from the Xerces sample code.
class StrX
{
public :
    // Constructor for a XMLCh*
      // Trasnlates the argument and saves it
    StrX(const XMLCh* const toTranscode)
    {
        // Call the private transcoding method
        fLocalForm = XMLString::transcode(toTranscode);
    }

      // or a DOMString
      // Trasnlates the argument and saves it
    StrX(const DOMString& toTranscode)
    {
        // Call the private transcoding method
        fLocalForm = toTranscode.transcode();
    }

      // delete the translated string
    ~StrX()
    {
        delete [] fLocalForm;
    }


    //  Getter method
    const char* localForm() const
    {
        return fLocalForm;
    }

private :
    // This is the local code page form of the string.
    char*   fLocalForm;
};


// How to output a StrX object
inline ostream& operator<<(ostream& target, 
  const StrX& toDump)
{
      // output the saved value in local code page
      // format
    target << toDump.localForm();
    return target;
}

// How to output a DOM_Node
inline ostream& operator<<(ostream& target, 
  const DOM_Node& node)
{
      // output the node type name, the node name and 
      // the node value
      target << type_name(node) 
        << " name=" << StrX(node.getNodeName()) 
        << " value=" << StrX(node.getNodeValue());
      return target;
}


// A short help message
void usage()
{
    cout << "nUsage:n"
            "    ximple XML_filenn"
            "This program exercises some features 
                   of Xercesn"
                  "XML_file must be the name of 
                         a file containing "
                  "properly formed XMLn"
         << endl;
}


// Returns a  type name char string for each node type
const char* type_name(const DOM_Node& node)
{
      switch(node.getNodeType())
      {
      case DOM_Node::ELEMENT_NODE:
            return "element     ";
      case DOM_Node::ATTRIBUTE_NODE:
            return "attribute   ";
      case DOM_Node::TEXT_NODE:
            return "text        ";
      case DOM_Node::CDATA_SECTION_NODE:
            return "cdata       ";
      case DOM_Node::ENTITY_REFERENCE_NODE:
            return "entity ref  ";
      case DOM_Node::ENTITY_NODE:
            return "entity      ";
      case DOM_Node::PROCESSING_INSTRUCTION_NODE:
            return "instruction ";
      case DOM_Node::COMMENT_NODE:
            return "comment     ";
      case DOM_Node::DOCUMENT_NODE:
            return "document    ";
      case DOM_Node::DOCUMENT_TYPE_NODE:
            return "doc type    ";
      case DOM_Node::DOCUMENT_FRAGMENT_NODE:
            return "doc fragment";
      case DOM_Node::NOTATION_NODE:
            return "notation    ";
      case DOM_Node::XML_DECL_NODE:
            return "xml decl    ";
      default:
            return "unknown     ";
      }
}

void lead_in(int level)
{
      for(int ix = 0; ix < level; ++ix)
      {
            cout << " ";
      }
}

// Output the node using the inline ostream&  
// operator<<() function defined above
void output_values(int level, DOM_Node& node)
{
      lead_in(level);
      cout << node << endl;
}

// An xml declaration node has its own special attributes
// that must be retrieved using getVersion(), getEncoding()
// and getStandalone(). These values cannot be retrieved 
// using a getAttributes() call.
void do_decl_node(int level,DOM_XMLDecl& decl)
{
      lead_in(level);
      cout << type_name(decl) 
        << " version=" << StrX(decl.getVersion()) 
        << " encoding=" << StrX(decl.getEncoding()) 
        << " standalone=" << StrX(decl.getStandalone()) 
            <<  endl;
}

// Attributes of an element (or node) are retrieved as 
// a NamedNodeMap through getAttriobutes(). The 
// resulting list can be processed as ifif were an 
// indexed list as show below, or by asking for 
// attributes by name.
void do_attributes(int level,DOM_Node& node)
{
      DOM_NamedNodeMap nnm;
      DOM_Node attr;
      long ix;

      nnm = node.getAttributes();
      if(nnm == NULL)
      {
            return;
      }
      for(ix = 0; ix < nnm.getLength(); ++ix)
      {
            attr = nnm.item(ix);
            output_values(level+1,attr);
      }


}

// This function will process first child and next 
// Sibling of an passed in node
void do_node_children(int level,DOM_Node& node)
{
      DOM_Node child;

      for(child = node.getFirstChild();
          child != NULL;
          child = child.getNextSibling())
      {
            do_node(level+1,child);
      }
}

// Process the node passed on the node type
void do_node(int level,DOM_Node& node)
{
      switch(node.getNodeType())
      {
      // output an element, then its attributes, 
      // and then its children
      case DOM_Node::ELEMENT_NODE:
            output_values(level,node);
            do_attributes(level,node);
            do_node_children(level,node);
            break;
      // output the document node and then
      // its children
      case DOM_Node::DOCUMENT_NODE:
            output_values(level,node);
            do_node_children(level,node);
            break;
      // output specila xml decl values
      case DOM_Node::XML_DECL_NODE:
            do_decl_node(level,(DOM_XMLDecl&)node);
            break;
      // whatever else output as a node
      default:
            output_values(level,node);
            break;
      }
}

// Set up a parser and then parse the
// XML in the passed in file name.
// Once it is parsed successfully,
// tree walk the nodes outputting values
void process(const char* xmlFile)
{
      // Instantiate the DOM parser.
      DOMParser parser;
      bool errorsOccured = false;


      // set the parser to create an XML declaration node
      // default is not to do this
      parser.setToCreateXMLDeclTypeNode(true);

      // Note: That a full blown parsing effort would
      // install a custom error handler that is derived
      // from the Xerces ErrorHandler class. For 
      // Example: ErrorHandler *errHandler = new 
      // MyErrorHandler; parser.setErrorHandler(errHandler);
      // The DOMPrint sample project provides an example
      // of this.


      // try a parse with a broad catch on all exceptions
      try
      {
          parser.parse(xmlFile);
      }
      catch (...)
      {
          cerr << "An error occured during 
            parsingn " << endl;
          errorsOccured = true;
      }

      // bail out if there were errors
      if(errorsOccured == true)
          exit (0);

      // get the top level document node from the parser
      DOM_Document doc = parser.getDocument();

      // and tree walk the nodes reporting as you go
      do_node(0,doc);
}


// Set up look for a file name and process it
// if found. If no name on the command line
// display usage and exit
int main(int argc, char* argv[])
{
    // Initialize the XML4C system
    try
    {
        XMLPlatformUtils::Initialize();
    }

    catch (const XMLException& toCatch)
    {
         cerr << "Error during initialization! :n"
              << StrX(toCatch.getMessage()) << endl;
         return 1;
    }

      // need a file name on the command line
      if(argc < 2)
      {
            usage();
            exit(0);
      }

      // the only argument should be the file name
      process(argv[1]);

    // And call the termination method
    XMLPlatformUtils::Terminate();
      return 0;
}

Code Example 10 A Xerces exercizer program

Assuming a file containing the following information:

<?xml version='1.0'?>
<vessel>
<!-- 
I hear Magellan is in the market 
for five vessels for an around the 
world cruise. See if he likes the 
look of The Victoria. Don't talk to 
his buyer. He doesn't know a Nautical 
Mile from a potato.
-->
<record_type>VR</record_type>
      <registered_name>The Victoria</registered_name>
      <type>CARAVEL</type>
      <seller><![CDATA[Gilligan & Sons]]></seller>
      <compliment>48</compliment>
      <capacity>
           <power>SAIL</power>
           <range units="Nautical Miles">24000</range>
      </capacity>
</vessel>
Code Example 11 An XML declaration

The file would look like Code Example 12 once all the extraneous spaces are removed. In Code Example 12, the wrapping is caused by the screen width. The actual file would contain no line ends.

<?xml version='1.0'?>
<vessel><!-- I hear Magellan is in the market for five vessels for an around the world cruise. See if he likes the look of The Victoria. Don't talk to his buyer. He doesn't know a Nautical Mile from a potato.--><record_type>VR</record_type><registered_name>The Victoria</registered_name><type>CARAVEL</type><seller><![CDATA[Gilligan & Sons]]></seller><compliment>48</compliment><capacity><power>SAIL</power><range units="Nautical Miles">24000</range></capacity></vessel>

Code Example 12 An XML declaration

The output of processing this file with the exercizer is shown in Code Example 13.

document     name=#document value=
 xml decl     version=1.0 encoding= standalone=
 element      name=vessel value=
  comment      name=#comment value= I hear Magellan is
 in the market for five vessels for an around the world
 cruise. See if he likes the look of The Victoria. Don't
 talk to his buyer. He doesn't know a Nautical Mile
 from a potato.
  element      name=record_type value=
   text         name=#text value=VR
  element      name=registered_name value=
   text         name=#text value=The Victoria
  element      name=type value=
   text         name=#text value=CARAVEL
  element      name=seller value=
   cdata        name=#cdata-section value=Gilligan & 
                  Sons
  element      name=compliment value=
   text         name=#text value=48
  element      name=capacity value=
   element      name=power value=
    text         name=#text value=SAIL
   element      name=range value=
    attribute    name=units value=Nautical Miles
    text         name=#text value=24000
Code Example 13 The output of Code Example 10 running against Code Example 12

Once you have mastered this example, try some of the simpler Xerces samples such as DOMPrint, MemParse, and StdInParse. Then take a look at CreateDOMDocument to learn how to create a document object by inserting values directly into the tree instead of creating them by loading an XML file.

Resource


The Apache Project for the complete Xerces package.

There are almost too many good XML sites, so it is difficult to zero in on a particular one. Two that are very useful are the O`Reilly site, and the XML 1.0 specification for the official scoop on the standard.

About the Author


Mo Budlong is president of King Computer Services and has been creating utilities, applications, and client server solutions on UNIX® boxes for over 15 years. He has published numerous books and articles on subjects ranging from Assembly language to XML. He is the author of the web based UNIX 101 column, and is soon to publish a book on UNIX basics.

Mo is also a musician who plays guitar, bass and keyboards, and messes around on several other instruments. He lives in Southern California with his wife and daughter and The Cat.

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.