Java Enterprise in a Nutshell (In a Nutshell (OReilly))
7.3. SAX
The SAX API provides a procedural approach to parsing an XML file. As a SAX parser iterates through an XML file, it performs callbacks to a user-specified object. These calls indicate the start or end of an element, the presence of character data, and other significant events during the life of the parser. SAX doesn't provide random access to the structure of the XML file; each tag must be handled as it is encountered by the browser. This means that SAX provides a relatively fast and efficient method of parsing. Because the SAX parser deals with only one element at a time, implementations can be extremely memory-efficient, making it often the only reasonable choice for dealing with particularly large files. 7.3.1. SAX Handlers
The SAX API allows you to create objects that handle XML parsing events, by implementing the org.xml.sax.ContentHandler, org.xml.sax.ErrorHandler, and org.xml.sax.DTDHandler interfaces.[*] Processing a document with SAX involves passing a handler implementation to the parser and calling the parse( ) method of SAXParser. The parser will read the contents of the XML file, calling the appropriate method on the handler when significant events (such as the start of a tag) occur. All handler methods may throw a SAXException in the event of an error. [*] We're not covering DTDHandler here, because it's rarely used. It is useful only if you need to know about unparsed entities and notations in the DTD associated with an XML document. We'll take a look at the ContentHandler, the ErrorHandler, and the generic but useful DefaultHandler interfaces next. 7.3.1.1. ContentHandler
Most, if not all, SAX applications implement the ContentHandler interface. The SAX parser will call methods on a ContentHandler when it encounters basic XML elements: chiefly, the start or end of a document, the start or end of an element, and character data within an element. The startDocument( ) and endDocument( ) methods are called at the beginning[*] and end of the parsing process and take no parameters. Most applications use startDocument( ) to create any necessary internal data stores and use endDocument( ) to dispose of them (for example, by writing to the database). [*] The first method called by the parser is actually setDocumentLocator( ), which provides the handler with an implementation of org.xml.sax.Locator. This object can report the current position of the parser within the XML file via its getColumnNumber( ), getLineNumber( ), getPublicId( ), and getSystemId( ) methods. However, while parser implementations are strongly encouraged to implement this method, they aren't required to. When the parser encounters a new element, it calls the startElement( ) method of the ContentHandler, passing a namespace URI, the local name of the element, the fully qualified name of the element (the namespace and the local name), and an org.xml.sax.Attributes object containing the element attributes. The Attributes interface allows the parser to inform the ContentHandler of attributes attached to an XML tag. For instance, the <order> tag in our earlier example contained two attributes, idnumber and custno, specified like this: <order idlabel="321" custno="98173">
To retrieve attributes when processing an element, call the getValue( ) method of attributes: public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException { if(localName.equals("order") System.out.println("New Order Number " + atts.getValue("idnumber") + " for Customer Number " + atts.getValue("custno")); }
Note that before we can safely run this line, we need to make sure that we are processing an <order> tag; otherwise, there is no guarantee that the particular attributes we are querying will be available.[*] [*] The parser returns all the attributes specified in the XML document, either explicitly or through a default value specified in the DTD. Attributes without defaults that aren't explicitly specified in the XML document itself aren't included. When the parser encounters the closing tag of an element (</order>, in this case), the parser calls the endElement( ) method, passing the same namespace URI, local name, and qualified name that were passed to startElement( ). Every startElement( ) call will have a corresponding endElement( ) call, even when the element is empty. These four methods all deal with handling information about XML tags but not with the data within a tag (unless that data is another tag). Much XML content consists of textual data outside the confines of tags and attributes. For example, here's the handling instruction from orders.xml: <handling>Please embroider in a tasteful manner!</handling>
When the SAX parser encounters the text between the tags, it calls the characters( ) method, passing a character array, a starting index within that array, and the length of the relevant character sequence within the array. This simple implementation of characters( ) prints the output to the screen: public void characters(char[] ch, int start, int length) throws SAXException { System.out.print(new String(ch, start, length)); }
Note that there is no guarantee that all of the characters you want will be delivered in the same call. Also, since the characters( ) method doesn't include any references to the parent element, to perform more complicated tasks (such as treating the characters differently depending on the element that contains them), you will need to store the name of the current element within the handler class itself. Example 7-1, later in this chapter, shows how to do this via the startElement( ) method. The characters( ) method might also be called when the parser encounters ignorable whitespace, such as a carriage return separating nested elements that don't otherwise have nested character data. If the parser is validating the document against a DTD, it must instead call the ignoreableWhitespace( ) method to report these characters. 7.3.1.2. ErrorHandler
Since SAX is a language-independent specification, it doesn't handle parsing errors by throwing exceptions. Instead, a SAX parser reports errors by calling methods on a user-supplied object that implements the ErrorHandler interface. The ErrorHandler interface includes three methods: error( ), fatalError( ), and warning( ). Each method takes an org.xml.sax.SAXParseException parameter. The programmer is free to handle the errors in whatever manner she deems appropriate; however, the specification doesn't require parsing to continue after a call to fatalError( ). 7.3.1.3. DefaultHandler
The API also provides the org.xml.sax.helpers.DefaultHandler class that implements all three handler interfaces. Since most handlers don't need to override every handler method, or even most, the easiest way to write a custom handler is to extend this object and override methods as necessary. 7.3.2. Using a SAX Parser
Once you have a handler or set of handlers, you need a parser. JAXP generates SAX parsers via a SAXParserFactory, as we saw earlier. The SAXParserFactory has three methods for further specifying parser behavior: setValidating( ) (which instructs the parser to validate the incoming XML file against its DTD or schema), setNamespaceAware( ) (which requests support for XML namespaces), and setFeature( ) (which allows configuration of implementation-specific attributes for parsers from particular vendors). It is possible to parse a document directly from a SAXParser object by passing an object that implements the ContentHandler interface to the parse( ) method, along with a path, URI, or InputStream containing the XML to be parsed. For more control, call the getXMLReader( ) method of SAXParser, which returns an org.xml.sax.XMLReader object. This is the underlying parser that actually processes the input XML and calls the three handler objects. Accessing the XMLReader directly allows programs to set specific ErrorHandler and DTDHandler objects, rather than being able to set a ContentHandler only. All events in the SAX parsing cycle are synchronous. The parse( ) method will not return until the entire document has been parsed, and the parser will wait for each handler method to return before calling the next one. 7.3.2.1. A SAX example: Processing orders
Example 7-1 uses a SAX DefaultHandler to process an XML document containing a set of incoming orders for a small business. It uses the startElement( ) method of ContentHandler to process each element, displaying relevant information. Element attributes are processed via the Attributes object passed to the startElement( ) method. When the parser encounters text within a tag, it calls the characters( ) method of ContentHandler. You can also call the set Property() method on the SAXParser to control its behavior. The standard JAXP property we saw in the previous section can be set using this method, for example. Example 7-1. Parsing XML with SAX
import javax.xml.parsers.*; import org.xml.sax.*; public class OrderHandler extends org.xml.sax.helpers.DefaultHandler { public static void main(String[] args) { SAXParserFactory spf = SAXParserFactory.newInstance(); spf.setValidating(true); //request a validating parser XMLReader xmlReader = null; try { SAXParser saxParser = spf.newSAXParser(); /* We need an XMLReader to use an ErrorHandler We could just pass the DataHandler to the parser if we wanted to use the default error handler. */ xmlReader = saxParser.getXMLReader(); xmlReader.setContentHandler(new OrderHandler()); xmlReader.setErrorHandler(new OrderErrorHandler()); xmlReader.parse("orders.xml"); } catch (Exception e) { e.printStackTrace(); } } // The startDocument() method is called at the beginning of parsing public void startDocument() throws SAXException { System.out.println("Incoming Orders:"); } // The startElement() method is called at the start of each element public void startElement(String namespaceURI, String localName, String rawName, Attributes atts) throws SAXException { if(localName.equals("order")) { System.out.print("\nNew Order Number " + atts.getValue("idnumber") + " for Customer Number " + atts.getValue("custno")); } else if (localName.equals("item")) { System.out.print("\nLine Item: " + atts.getValue("idnumber") + " (Qty " + atts.getValue("quantity") + ")"); } else if (localName.equals("shippingaddr")) { System.out.println("\nShip by " + atts.getValue("method") + " to:"); } else if (localName.equals("handling")) { System.out.print("\n\tHandling Instructions: "); } } // Print Characters within a tag // This will print the contents of the <shippingaddr> and <handling> tags // There is no guarantee that all characters will be delivered in a // single call public void characters(char[] ch, int start, int length) throws SAXException { System.out.print(new String(ch, start, length)); } /* A custom error handling class, although DefaultHandler implements both interfaces. Here we just throw the exception back to the user.*/ private static class OrderErrorHandler implements ErrorHandler { public void error(SAXParseException spe) throws SAXException { throw new SAXException(spe); } public void warning(SAXParseException spe) throws SAXException { System.out.println("\nParse Warning: " + spe.getMessage()); } public void fatalError(SAXParseException spe) throws SAXException { throw new SAXException(spe); } } }
In a real application, we would want to treat error handling in a more robust fashion, probably by reporting parse errors to a logging utility or EJB. An actual order management utility would populate a database table or an Enterprise JavaBean's object. |