THIS INFORMATION AND EXAMPLES OF CODE ARE BEING PROVIDED BY SUN AS A COURTESY, "AS IS," AND SUN DISCLAIMS ANY AND ALL WARRANTIES PERTAINING THERETO, INCLUDING ANY WARRANTIES OF MERCHANTABILTY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT. XML is a powerful new tool for representing data and their relationships. XML syntax is straightforward, and XML parsers are freely available. Understanding XML will help you put XML to use in your own projects. XML has pretty much revolutionized the transmission of data between applications. It is now possible to transmit the data, the field name, the type of data and much, much more. Consider for example, the following listing containing information being sent from company A to company B.
Code Example 1 Sample flat dataIf you can fathom what these data are all about by looking at this record, then you should apply for your spot in the data psychic's hall of fame, immediately. This type of information is usually called flat data, and a file containing lots of this stuff would be called a flat file. It is flat because it is just bytes with no perspective on what the data are, or what is important. Conversely, in Code Example 2 below, an XML version of the same data provides a hierarchical description of the information
<vessel>
<record_type>VR</record_type>
<registered_name>The
Victoria</registered_name>
<type>CARAVEL</type>
<compliment>48</compliment>
<capacity>
<power>SAIL</power>
<range units="Nautical
Miles">24000</range>
</capacity>
</vessel>
Code Example 2 Sample XMLIn Code Example 2 you are looking at exactly the same information, but suddenly it makes sense. This is information about a sailing vessel, its crew compliment, and range in nautical miles. The tags and the text associated with the tags are called elements, the text of the tag is called text or the value of the element, the additional data such as <code>units</code> are called attributes. The XML carries both the data, and data about the data (frequently called metadata). There have been earlier efforts to produce a format that would carry data and data about data, the most famous of which is probably Comma Separated Values or CSV format. A CSV format file provided the same information in comma separated (and quoted for strings) fields as in Code Example 3.
Code Example 3 Sample CSV formatCSV added a "First row contains column names" option so that data looked like Code Example 4.
Code Example 4 Sample CSV format with column names in row oneBut the program and the programmer had to know in advance that the first row contained the column names, or the program had to do many strange-looking tests to decide if row one contained a list of column names. Once you get down to the range field, you have a lost dimension that is immediately visible in the XML. A non-nautical buyer of "The Victoria" might pass it over thinking that the range was expressed in miles and consequently not sufficient to circumnavigate the earth. If that buyer had been an agent working for Ferdinand Magellan, "The Victoria" might never have been purchased at all. Instead, the first ship to sail around the world could have ended up on milk runs in the Mediterranean. XML offers advantages to the human who must read that data, in that it usually communicates a great deal of additional information about the data. For the programmer, the addition of all the data about data makes it possible to programmatically come up with reasonable methods of extracting the desired information. XML is just the text as you see it in Code Example 2. In fact XML is probably more likely to look like the sample in Code Example 5. The tags are strung together end to end without the tidy structure shown in Code Example 2. This is because of the way that an XML parser handles white space, which is covered later.
Code Example 5 Sample XML textIn order to use XML effectively, you must load the XML text into a data structure that allows you to take advantage of the tree-like structure of the data. The data structure is called a document object. Loading the data consists of reading the XML text into a parser that checks for well-formed XML. Some of the simpler rules of well-formed XML are:
When an XML document is loaded into a document object, the entire structure is expressed as one or more nodes, having parent nodes, children nodes, and sibling nodes. Elements, attributes, text values, and other pieces of the document each become nodes. There are three main characteristics of a node.
There are 13 node types
You have already seen 1, 2, 3, and 9. Some of the other common nodes are described below. Comment NodeA comment node is any text between <!-- and -->. A comment is shown in Code Example 6.
<vessel>
<!--
I hear Magellan is in the market
for five vessels for an around the
world cruise. See if he likes the
look of The Victoria. Don't talk to
his buyer. He doesn't know a Nautical
Mile from a potato.
-->
<record_type>VR</record_type>
<registered_name>The
Victoria</registered_name>
<type>CARAVEL</type>
<compliment>48</compliment>
<capacity>
<power>SAIL</power>
<range units="Nautical
Miles">24000</range>
</capacity>
</vessel>
Code Example 6 An XML commentAn XML Declaration An XML declaration node can be added as the first line of the document as in Code Example 7.
The XML declaration can contain three pieces of information used to process the XML. An The encoding attribute allows for character sets other that your locale default. Some sample values for encoding might be: encoding='UTF-8' (also the default) encoding='USASCII' encoding='ISO8859-1'
<?xml version='1.0'?>
<vessel>
<!--
I hear Magellan is in the market
for five vessels for an around the
world cruise. See if he likes the
look of The Victoria. Don't talk to
his buyer. He doesn't know a Nautical
Mile from a potato.
-->
<record_type>VR</record_type>
<registered_name>The
Victoria</registered_name>
<type>CARAVEL</type>
<compliment>48</compliment>
<capacity>
<power>SAIL</power>
<range units="Nautical
Miles">24000</range>
</capacity>
</vessel>
Code Example 7 An XML declaration
Entity References
The Table 1 XML entities
CDATA Section if (ch <= 'a') ch &= 'a'; <example_01> if (ch <= 'a') ch &= 'a'; </example_01> <example_02> <![CDATA[if (ch <= 'a') ch &= 'a';]]> </example_02> Code Example 8 C code in XML using entity references and CDATA section
The Trouble with Spaces
<vessel>This ship is for sale
<record_type>VR</record_type>
<registered_name>The
Victoria</registered_name>
<type>CARAVEL</type>
<compliment>48</compliment>
<capacity>
<power>SAIL</power>
<range units="Nautical
Miles">24000</range>
</capacity>
</vessel>
Code Example 9 Sample XML with text in a parent element
Text appearing between tags is considered to be text, and this includes white spaces. Therefore the text between
The line terminator and spaces between Figure 1 is a representation of the nodes that are created by parsing Code Example 7 if Code Example 7 was jammed together without intervening spaces. Each rectangle represents a node. Each node, except for the XML declaration node, has three characteristics: a node type, a name, and a value.
The top level element The XML declaration node also appears directly under the document node, but it is not an element.
When an element such as
The attribute node
Once you have decided to use XML, you must choose an XML parsing package. There are several, but the obvious choice at the moment is Xerces from the Apache project. The package with complete source code is available from The Jakarta Project and downloads, installs, and builds easily. The sample code provided with Xerces is almost too thorough. Instead of a series of gentle steps, even the simplest samples use features of Xerces that will leave you scratching your head.
In Code Example 10, the
This code recursively walks the document object tree by using
The code is also upside-down, with
// xercecizer.cpp : Some simple Xerces Exercises
#include <iostream>
using std::cout;
using std::endl;
using std::cerr;
using std::ostream;
#include <sstream>
using std::ostringstream;
#include <string>
using std::string;
#include <util/PlatformUtils.hpp>
#include <util/XMLString.hpp>
#include <parsers/DOMParser.hpp>
#include <dom/DOM_NamedNodeMap.hpp>
// forward declaration of some functions
void do_node(int level, DOM_Node& node);
const char* type_name(const DOM_Node& node);
// This is a simple class to do easy though
// not terribly efficient trancoding of XMLCh
// and DOMString data to local code page for display.
// using the Xerces .transcode() member functions.
// Parts derived from the Xerces sample code.
class StrX
{
public :
// Constructor for a XMLCh*
// Trasnlates the argument and saves it
StrX(const XMLCh* const toTranscode)
{
// Call the private transcoding method
fLocalForm = XMLString::transcode(toTranscode);
}
// or a DOMString
// Trasnlates the argument and saves it
StrX(const DOMString& toTranscode)
{
// Call the private transcoding method
fLocalForm = toTranscode.transcode();
}
// delete the translated string
~StrX()
{
delete [] fLocalForm;
}
// Getter method
const char* localForm() const
{
return fLocalForm;
}
private :
// This is the local code page form of the string.
char* fLocalForm;
};
// How to output a StrX object
inline ostream& operator<<(ostream& target,
const StrX& toDump)
{
// output the saved value in local code page
// format
target << toDump.localForm();
return target;
}
// How to output a DOM_Node
inline ostream& operator<<(ostream& target,
const DOM_Node& node)
{
// output the node type name, the node name and
// the node value
target << type_name(node)
<< " name=" << StrX(node.getNodeName())
<< " value=" << StrX(node.getNodeValue());
return target;
}
// A short help message
void usage()
{
cout << "nUsage:n"
" ximple XML_filenn"
"This program exercises some features
of Xercesn"
"XML_file must be the name of
a file containing "
"properly formed XMLn"
<< endl;
}
// Returns a type name char string for each node type
const char* type_name(const DOM_Node& node)
{
switch(node.getNodeType())
{
case DOM_Node::ELEMENT_NODE:
return "element ";
case DOM_Node::ATTRIBUTE_NODE:
return "attribute ";
case DOM_Node::TEXT_NODE:
return "text ";
case DOM_Node::CDATA_SECTION_NODE:
return "cdata ";
case DOM_Node::ENTITY_REFERENCE_NODE:
return "entity ref ";
case DOM_Node::ENTITY_NODE:
return "entity ";
case DOM_Node::PROCESSING_INSTRUCTION_NODE:
return "instruction ";
case DOM_Node::COMMENT_NODE:
return "comment ";
case DOM_Node::DOCUMENT_NODE:
return "document ";
case DOM_Node::DOCUMENT_TYPE_NODE:
return "doc type ";
case DOM_Node::DOCUMENT_FRAGMENT_NODE:
return "doc fragment";
case DOM_Node::NOTATION_NODE:
return "notation ";
case DOM_Node::XML_DECL_NODE:
return "xml decl ";
default:
return "unknown ";
}
}
void lead_in(int level)
{
for(int ix = 0; ix < level; ++ix)
{
cout << " ";
}
}
// Output the node using the inline ostream&
// operator<<() function defined above
void output_values(int level, DOM_Node& node)
{
lead_in(level);
cout << node << endl;
}
// An xml declaration node has its own special attributes
// that must be retrieved using getVersion(), getEncoding()
// and getStandalone(). These values cannot be retrieved
// using a getAttributes() call.
void do_decl_node(int level,DOM_XMLDecl& decl)
{
lead_in(level);
cout << type_name(decl)
<< " version=" << StrX(decl.getVersion())
<< " encoding=" << StrX(decl.getEncoding())
<< " standalone=" << StrX(decl.getStandalone())
<< endl;
}
// Attributes of an element (or node) are retrieved as
// a NamedNodeMap through getAttriobutes(). The
// resulting list can be processed as ifif were an
// indexed list as show below, or by asking for
// attributes by name.
void do_attributes(int level,DOM_Node& node)
{
DOM_NamedNodeMap nnm;
DOM_Node attr;
long ix;
nnm = node.getAttributes();
if(nnm == NULL)
{
return;
}
for(ix = 0; ix < nnm.getLength(); ++ix)
{
attr = nnm.item(ix);
output_values(level+1,attr);
}
}
// This function will process first child and next
// Sibling of an passed in node
void do_node_children(int level,DOM_Node& node)
{
DOM_Node child;
for(child = node.getFirstChild();
child != NULL;
child = child.getNextSibling())
{
do_node(level+1,child);
}
}
// Process the node passed on the node type
void do_node(int level,DOM_Node& node)
{
switch(node.getNodeType())
{
// output an element, then its attributes,
// and then its children
case DOM_Node::ELEMENT_NODE:
output_values(level,node);
do_attributes(level,node);
do_node_children(level,node);
break;
// output the document node and then
// its children
case DOM_Node::DOCUMENT_NODE:
output_values(level,node);
do_node_children(level,node);
break;
// output specila xml decl values
case DOM_Node::XML_DECL_NODE:
do_decl_node(level,(DOM_XMLDecl&)node);
break;
// whatever else output as a node
default:
output_values(level,node);
break;
}
}
// Set up a parser and then parse the
// XML in the passed in file name.
// Once it is parsed successfully,
// tree walk the nodes outputting values
void process(const char* xmlFile)
{
// Instantiate the DOM parser.
DOMParser parser;
bool errorsOccured = false;
// set the parser to create an XML declaration node
// default is not to do this
parser.setToCreateXMLDeclTypeNode(true);
// Note: That a full blown parsing effort would
// install a custom error handler that is derived
// from the Xerces ErrorHandler class. For
// Example: ErrorHandler *errHandler = new
// MyErrorHandler; parser.setErrorHandler(errHandler);
// The DOMPrint sample project provides an example
// of this.
// try a parse with a broad catch on all exceptions
try
{
parser.parse(xmlFile);
}
catch (...)
{
cerr << "An error occured during
parsingn " << endl;
errorsOccured = true;
}
// bail out if there were errors
if(errorsOccured == true)
exit (0);
// get the top level document node from the parser
DOM_Document doc = parser.getDocument();
// and tree walk the nodes reporting as you go
do_node(0,doc);
}
// Set up look for a file name and process it
// if found. If no name on the command line
// display usage and exit
int main(int argc, char* argv[])
{
// Initialize the XML4C system
try
{
XMLPlatformUtils::Initialize();
}
catch (const XMLException& toCatch)
{
cerr << "Error during initialization! :n"
<< StrX(toCatch.getMessage()) << endl;
return 1;
}
// need a file name on the command line
if(argc < 2)
{
usage();
exit(0);
}
// the only argument should be the file name
process(argv[1]);
// And call the termination method
XMLPlatformUtils::Terminate();
return 0;
}
Code Example 10 A Xerces exercizer programAssuming a file containing the following information:
<?xml version='1.0'?>
<vessel>
<!--
I hear Magellan is in the market
for five vessels for an around the
world cruise. See if he likes the
look of The Victoria. Don't talk to
his buyer. He doesn't know a Nautical
Mile from a potato.
-->
<record_type>VR</record_type>
<registered_name>The Victoria</registered_name>
<type>CARAVEL</type>
<seller><![CDATA[Gilligan & Sons]]></seller>
<compliment>48</compliment>
<capacity>
<power>SAIL</power>
<range units="Nautical Miles">24000</range>
</capacity>
</vessel>
Code Example 11 An XML declarationThe file would look like Code Example 12 once all the extraneous spaces are removed. In Code Example 12, the wrapping is caused by the screen width. The actual file would contain no line ends.
Code Example 12 An XML declarationThe output of processing this file with the exercizer is shown in Code Example 13.
document name=#document value=
xml decl version=1.0 encoding= standalone=
element name=vessel value=
comment name=#comment value= I hear Magellan is
in the market for five vessels for an around the world
cruise. See if he likes the look of The Victoria. Don't
talk to his buyer. He doesn't know a Nautical Mile
from a potato.
element name=record_type value=
text name=#text value=VR
element name=registered_name value=
text name=#text value=The Victoria
element name=type value=
text name=#text value=CARAVEL
element name=seller value=
cdata name=#cdata-section value=Gilligan &
Sons
element name=compliment value=
text name=#text value=48
element name=capacity value=
element name=power value=
text name=#text value=SAIL
element name=range value=
attribute name=units value=Nautical Miles
text name=#text value=24000
Code Example 13 The output of Code Example 10 running against Code Example 12
Once you have mastered this example, try some of the simpler Xerces samples such as ResourceThe Apache Project for the complete Xerces package. There are almost too many good XML sites, so it is difficult to zero in on a particular one. Two that are very useful are the O`Reilly site, and the XML 1.0 specification for the official scoop on the standard. About the AuthorMo Budlong is president of King Computer Services and has been creating utilities, applications, and client server solutions on UNIX® boxes for over 15 years. He has published numerous books and articles on subjects ranging from Assembly language to XML. He is the author of the web based UNIX 101 column, and is soon to publish a book on UNIX basics. Mo is also a musician who plays guitar, bass and keyboards, and messes around on several other instruments. He lives in Southern California with his wife and daughter and The Cat. | |||||||||||||||||||||||||||||||
|
| ||||||||||||