Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX
The XML specification defines three classes of problems that can occur in an XML document. In order of decreasing severity, these are as follows : Fatal Error A well- formedness error. As soon as the parser detects it, it must throw in the towel and stop parsing. The parse() method throws a SAXParseException when a fatal error is detected . Parsers have a little leeway in whether they detect fatal errors. In particular, nonvalidating parsers may not catch certain fatal errors that occur in the external DTD subset, and many parsers don't actually check everything they're supposed to check. However, if a parser does detect a fatal error, then it must give up and stop parsing. Error An error but not a well-formedness error. The most common is a validity error, although there are a few other kinds as well. Some parsers classify violations of namespace well-formedness as errors. Parsers may or may not detect these errors. If a parser does detect one of these errors, it may or may not throw a SAXParseException and it may or may not continue parsing. (Validity errors generally do not cause SAXParseException s. Other kinds of errors may, depending on the parser.) These sorts of errors are a source of some interoperability problems in XML, because two parsers may behave differently given the same document. Warning Not itself an error. Nonetheless, it may indicate a mistake of some kind in the document. For example, a parser might issue a warning if it encountered an element named XMLDocument . That's because all names beginning with "XML" (in any arrangement of case) are reserved by the W3C for future standards. Parsers may or may not detect these types of problems. If a parser does detect one, it will not throw an exception but will continue parsing. In addition, a parser may encounter an I/O problem that has nothing to do with XML. For example, your cat might knock the Ethernet cable out of the back of your PC while you're downloading a large XML document from a remote web server. If the parser detects a well-formedness error in the document it's parsing, then parse() throws a SAXException . In the event of an I/O error, it throws an IOException . The parser may or may not throw a SAXException in the event of a nonfatal error, and it will not throw an exception for a warning. As you can see, the only kind of XML problem the parser is guaranteed to tell you about through an exception is the well-formedness error. If you want to be informed of the other kinds of errors and possible problems, you need to implement the ErrorHandler interface, and register your ErrorHandler implementation with the XMLReader . SAXExceptions
The SAXException class, demonstrated in Example 7.4, is the generic exception class for almost anything (other than an I/O problem) that can go wrong while processing an XML document with SAX. Not only the parse() method but also most of the callback methods in the various SAX interfaces are declared to throw this exception. If you detect a problem while processing an XML document, your code can throw its own SAXException . Example 7.4 The SAXException Class
package org.xml.sax; public class SAXException extends Exception { public SAXException() public SAXException(String message) public SAXException(Exception rootCause) public SAXException(String message, Exception e) public String getMessage() public Exception getException() public String toString() } Nested Exceptions
SAXException may not always be the exception you want to throw, however. For example, suppose you're parsing a document containing an XML digital signature, and the endElement() method notices that the base64 encoded text provided in the P element (which represents the prime modulus of a DSA key) does not decode to a prime number the way it's supposed to. You naturally want to throw a java.security.InvalidKeyException to warn the client application of this. But endElement() cannot throw a java.security.InvalidKeyException only a SAXException . In this case, you wrap the exception you really want to throw inside a SAXException and throw the SAXException instead. For example, Exception nestedException = new InvalidKeyException("Modulus is not prime!"); SAXException e = new SAXException(nestedException); throw e; The code that catches the SAXException can retrieve the original exception using the getException() method. For example, the client application method might indeed be declared to throw an InvalidKeyException , so you could cast the nested exception to its real type and throw it into the appropriate catch block elsewhere in the call chain: catch (SAXException e) { Exception rootCause = e.getException(); if (rootCause == null) { // handle it as an XML problem... } else { if (rootCause instanceof InvalidKeyException) { InvalidKeyException ike = (InvalidKeyException) rootCause; throw ike; } else if (rootCause instanceof SomeOtherException) { SomeOtherException soe = (SomeOtherException) rootCause; throw soe; } ... } } SAXException Subclasses
SAX defines several more specific subclasses of SAXException for specific problems, even though most methods are only declared to throw a generic SAXException . These subclasses include SAXParseException , SAXNotRecognizedException , and SAXNotSupportedException . In addition, parsers can extend SAXException with their own custom subclasses, but few do this. A SAXParseException indicates a fatal error, error, or warning in an XML document. The parse() method of the XMLReader interface throws this when it encounters a well-formedness error. SAXParseException is also passed as an argument to the methods of the ErrorHandler interface to signal any of the three kinds of problems an XML document may contain. In addition to the usual exception methods like getMessage() and printStackTrace() that SAXParseException inherits from its superclasses, it provides methods to get the public ID and system ID of the file where the well-formedness error occurs (remember, XML documents that use external parsed entities can be divided among multiple separate files) and the line number and column number within that file where the well-formedness error occurs. Example 7.5 The SAXParseException Class
package org.xml.sax; public class SAXParseException extends SAXException { public SAXParseException(String message, Locator locator) public SAXParseException(String message, Locator locator, Exception e) public SAXParseException(String message, String publicID, String systemID, int lineNumber, int columnNumber) public SAXParseException(String message, String publicID, String systemID, int lineNumber, int columnNumber, Exception e) public String getPublicId() public String getSystemId() public int getLineNumber() public int getColumnNumber() } The lines and column numbers that the parser reports for the problem may not always be perfectly accurate. Nonetheless, they should be close to where the problem begins or ends. (Some parsers give the line and column numbers for the start-tag of a problem element. Others give the line and column numbers for the endtag.) If the document is so malformed that the parser can't even begin working with it, particularly if it isn't an XML document at all, then the parser will probably indicate that the error occurred at line -1, column -1. Example 7.6 enhances last chapter's SAXChecker program so that it reports the line numbers of any well-formedness errors. There are two catch blocksone for SAXParseException and another one for the more generic SAXException so it's possible to distinguish between well-formedness errors and other problems, such as not being able to find the right XMLReader implementation class. Example 7.6 A SAX Program That Parses a Document and Identifies the Line Numbers of Any Well-Formedness Errors
import org.xml.sax.*; import org.xml.sax.helpers.XMLReaderFactory; import java.io.IOException; public class BetterSAXChecker { public static void main(String[] args) { if (args.length <= 0) { System.out.println("Usage: java BetterSAXChecker URL"); return; } String document = args[0]; try { XMLReader parser = XMLReaderFactory.createXMLReader(); parser.parse(document); System.out.println(document + " is well-formed."); } catch (SAXParseException e) { System.out.print(document + " is not well-formed at "); System.out.print("line " + e.getLineNumber() + ", column " + e.getColumnNumber() ); System.out.println(" in the entity " + e.getSystemId()); } catch (SAXException e) { System.out.println("Could not check document because " + e.getMessage()); } catch (IOException e) { System.out.println( "Due to an IOException, the parser could not check " + document ); } } } Following is the output I got when I first ran this program across my Cafe con Leche home page. The first time I neglected to specify a parser, which produced a generic SAXException . The second time I corrected that mistake, and a SAXParseException signaled a well-formedness error. % java BetterSAXChecker http://www.cafeconleche.org Could not check document because System property org.xml.sax.driver not specified % java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser BetterSAXChecker http://www.cafeconleche.org http://www.cafeconleche.org is not well-formed at line 64, column 64 in the entity http://www.cafeconleche.org/ Not- Necessarily -Fatal Errors
XML includes a few errors that fall into a gray area. These are errors but neither fatal well-formedness errors nor nonfatal validity errors. The most common such error is an ambiguous content model in an element declaration. For example, consider the following declaration, which states that an Actor can have between zero and two Part s: <!ELEMENT Actor (Part?, Part?)> The problem occurs with an Actor element that has one Part , like this: <Actor> <Part>Cyrano</Part> </Actor> Does this one Part match the first Part in the content model or the second one? There's no way to tell. Some parsers have trouble with this construct, and other parsers don't notice any problem at all. The XML specification calls this an error but does not classify it as a fatal error. Different parsers treat these not-necessarily-fatal errors differently. Some parsers throw a SAXParseException when one is encountered. Other parsers let them pass without comment. And still others report them in a different way but do not throw an exception. For maximum compatibility, try to design your DTDs and instance documents to avoid this problem. The ErrorHandler Interface
Throwing an exception aborts the parsing process, but not all problems encountered in an XML document necessarily require such a radical step. In particular, validity errors are not signaled by an exception because that would stop parsing. If you want your program to be informed of nonfatal errors, then you must register an ErrorHandler object with the XMLReader . Then the parser will tell you about problems in the document by passing (not throwing!) a SAXParseException to one of the methods in this object. Example 7.7 summarizes the ErrorHandler interface. As you can see, it has three callback methods corresponding to the three different kinds of problems a parser may detect. When the parser detects one of these problems, it passes a SAXParseException to the appropriate method. If you want to treat errors or warnings as fatal, then you can throw the exception you were passed. (The parse() method will always throw an exception for a fatal error, even if you don't.) If you don't want to treat them as fatal (and most often you don't), then you can do something else with the information wrapped in the exception. Example 7.7 The ErrorHandler Interface
package org.xml.sax; public interface ErrorHandler { public void warning(SAXParseException exception) throws SAXException; public void error(SAXParseException exception) throws SAXException; public void fatalError(SAXParseException exception) throws SAXException; } The following two methods install an ErrorHandler into an XMLReader : public void setErrorHandler (ErrorHandler handler ) public ErrorHandler getErrorHandler () You can uninstall an ErrorHandler by passing null to setErrorHandler() . Example 7.8 is a program that checks documents for well-formedness errors and other problems. It reports all errors detected, no matter how small, through the ErrorHandler interface. Example 7.8 A SAX Program That Reports All Problems Found in an XML Document
import org.xml.sax.*; import org.xml.sax.helpers.XMLReaderFactory; import java.io.IOException; public class BestSAXChecker implements ErrorHandler { public void warning(SAXParseException exception) { System.out.println("Warning: " + exception.getMessage()); System.out.println(" at line " + exception.getLineNumber() + ", column " + exception.getColumnNumber()); System.out.println(" in entity " + exception.getSystemId()); } public void error(SAXParseException exception) { System.out.println("Error: " + exception.getMessage()); System.out.println(" at line " + exception.getLineNumber() + ", column " + exception.getColumnNumber()); System.out.println(" in entity " + exception.getSystemId()); } public void fatalError(SAXParseException exception) { System.out.println("Fatal Error: " + exception.getMessage()); System.out.println(" at line " + exception.getLineNumber() + ", column " + exception.getColumnNumber()); System.out.println(" in entity " + exception.getSystemId()); } public static void main(String[] args) { if (args.length <= 0) { System.out.println("Usage: java BestSAXChecker URL"); return; } String document = args[0]; try { XMLReader parser = XMLReaderFactory.createXMLReader(); ErrorHandler handler = new BestSAXChecker(); parser.setErrorHandler(handler); parser.parse(document); // If the document isn't well-formed, an exception has // already been thrown and this has been skipped. System.out.println(document + " is well-formed."); } catch (SAXParseException e) { System.out.print(document + " is not well-formed at "); System.out.println("Line " + e.getLineNumber() + ", column " + e.getColumnNumber() ); System.out.println(" in entity " + e.getSystemId()); } catch (SAXException e) { System.out.println("Could not check document because " + e.getMessage()); } catch (IOException e) { System.out.println( "Due to an IOException, the parser could not check " + document ); } } } Following is the output from running BestSAXChecker across the Docbook XML source code for an early version of this chapter: % java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser BestSAXChecker xmlreader.xml Error: The namespace prefix "xinclude" was not declared. at line 349, column 92 in entity file:///D:/books/XMLJAVA/xmlreader.xml Error: The namespace prefix "xinclude" was not declared. at line 530, column 95 in entity file:///D:/books/XMLJAVA/xmlreader.xml Error: The namespace prefix "xinclude" was not declared. at line 545, column 84 in entity file:///D:/books/XMLJAVA/xmlreader.xml Error: The namespace prefix "xinclude" was not declared. at line 688, column 93 in entity file:///D:/books/XMLJAVA/xmlreader.xml Fatal Error: The element type "para" must be terminated by the matching end-tag "</para>". at line 706, column 42 in entity file:///D:/books/XMLJAVA/xmlreader.xml Could not check document because Stopping after fatal error: The element type "para" must be terminated by the matching end-tag "</para>". BestSAXChecker complains several times about an undeclared namespace prefix for the XInclude elements I use to merge in source code examples like Example 7.8. Then, about three- quarters of the way through the document, it encounters a well-formedness error where I neglected to put an end-tag in the right place. At this point parsing stops. If there are any errors after that point, they aren't reported . Once I fixed those problems, the file became well-formed and valid: % java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser BestSAXChecker xmlreader.xml xmlreader.xml is well-formed. Beyond simple well-formedness, the errors that this program catches depend on the underlying parser. All conformant parsers detect all well-formedness errors. Most modern parsers should also catch any violations of namespace well-formedness. Whether this program catches validity errors depends on the parser. Most parsers do not validate by default. Instead they require the client application to explicitly request validation by setting the http://xml.org/sax/features/validation feature to true. I take this subject up next . |