Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX
Attributes are not reported through separate callbacks. Instead an Attributes object containing all the attributes of an element is passed to the startElement() method for the start-tag or empty-element tag of the element that possesses the attributes. Example 6.8 summarizes the Attributes interface. Example 6.8 The SAX Attributes Interface
package org.xml.sax; public interface Attributes { public int getLength (); public String getQName(int index); public String getURI(int index); public String getLocalName(int index); public int getIndex(String uri, String localPart); public int getIndex(String qualifiedName); public String getType(String uri, String localName); public String getType(String qualifiedName); public String getType(int index); public String getValue(String uri, String localName); public String getValue(String qualifiedName); public String getValue(int index); } If you know the qualified name or namespace URI and local name of the attribute you want, Attributes can look up its value and type. If you don't know the names of the attributes at compile-time, you can iterate through all of the attributes of an element instead. Attributes are unordered. However, for programmer convenience the Attributes interface is designed as a list. You can ask for the value, local name, qualified name, type, and namespace URI of an attribute by giving its index into the list. Just don't assume that the order of the attributes in this list is necessarily the same as in the original document. More often than not, it isn't. The type of the attribute is reported as one of these nine constant strings, exactly as types would be indicated in an ATTLIST declaration in a DTD:
Enumerated types are reported as having type NMTOKEN . Undeclared attributes are reported as having type CDATA . SAX does not yet support schema types such as int or gYear . Maybe in SAX 3.0.
Caution A few parsers are not 100 percent compliant with the SAX specification here. In particular, Crimson and Xerces 2.0.x use the string ENUMERATION for enumerated types instead of NMTOKEN . Xerces 1.4. reports an enumerated type as a string containing the actual enumeration, for example, ( yes no maybe) .
If a declared attribute has any type other than CDATA , then the parser normalizes its value. This means that all tabs, carriage returns, and linefeeds are converted to a single space; runs of spaces are converted to a single space; and leading and trailing white space is stripped. Only normalized values are reported by the getValue() methods . However, in order to determine an attribute type, the parser must read the DTD. If an attribute is declared in the external DTD subset, then nonvalidating parsers that do not read the external subset will assume the attribute has type CDATA, and fail to normalize. If you ask an Attributes object for information about an attribute (for example, type, name, or value) that is not in that particular list, then all of the methods that normally return a String return null instead. The getIndex() methods return -1. None of these methods throws any exceptions. However, if you try to use the return values without checking for null or -1 first, then you're asking for a NullPointerException or an ArrayIndexOutOfBoundsException . SAX 2.0 does not distinguish between attributes that were present in the instance document and attributes that were defaulted in from the DTD or schema. This may be added in SAX 2.1. For an example, I'm going to develop a web spider that follows simple XLinks. XLink is an attribute-based syntax for embedding hypertext in arbitrary XML documents. Elements are identified as XLinks by an xlink:type attribute with the value simple . (There's also a more powerful and more complex extended XLink, which I'm going to ignore for the purposes of this example.) The URL the link points to is contained in an xlink:href attribute. The xlink prefix is mapped to the namespace URI http://www.w3.org/TR/1999/xlink . As always, the prefix can change as long as the URI stays the same. For example, this is an XLink that points to The Nation 's home page: <magazine xmlns:xlink="http://www.w3.org/TR/1999/xlink" xlink:type="simple" xlink:href="http://www.thenation.com/"> The Nation </magazine> Note especially that the element name and content are irrelevant to the link, which is encoded purely in attributes. The same link could be written as follows: <foo xmlns:xlink="http://www.w3.org/TR/1999/xlink" xlink:type="simple" xlink:href="http://www.thenation.com/"> Foo </foo> All of the information required to process the link is included in the attributes. Consequently, we can use the Attributes interface and the startElement() method to design a spider that follows XLinks. Example 6.9 is such a program. Currently this spider does nothing more than follow the links and print their URLs, but it would not be hard to add code to load the discovered documents into a database or perform some other useful operation. Example 6.9 A ContentHandler Class That Spiders XLinks
import org.xml.sax.*; import org.xml.sax.helpers.*; import java.io.*; import java.util.*; public class SAXSpider extends DefaultHandler { // Need to keep track of where we've been // so we don't get stuck in an infinite loop private List spideredURIs = new Vector(); // This linked list keeps track of where we're going. // Although the LinkedList class does not guarantee queue like // access, I always access it in a first-in/first-out fashion. private LinkedList queue = new LinkedList(); private String currentURI; private XMLReader parser; public SAXSpider(XMLReader parser, String uri) { this.parser = parser; this.currentURI = uri; } public void endDocument() { spideredURIs.add(currentURI); System.out.println("Visited " + currentURI); String uri; try { uri = (String) queue.removeLast(); } catch (NoSuchElementException e) { // The queue is empty; we're finished. return; } this.currentURI = uri; try { parser.parse(uri); } catch (Exception e) { // just skip this one and move on to the next this.endDocument(); } } public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) { String type = atts.getValue("http://www.w3.org/1999/xlink", "type"); if (type != null) { String href = atts.getValue("http://www.w3.org/1999/xlink", "href"); if (href != null) { if (!spideredURIs.contains(href)) { queue.addFirst(href); } } } } public static void main(String[] args) { if (args.length == 0) { System.out.println("Usage: java SAXSpider URL1"); } String uri = args[0]; try { XMLReader parser = XMLReaderFactory.createXMLReader( "org.apache.xerces.parsers.SAXParser" ); // Install the ContentHandler ContentHandler spider = new SAXSpider(parser, uri); parser.setContentHandler(spider); parser.parse(uri); } catch (Exception e) { System.err.println(e); } }// end main }// end SAXSpider The startElement() method simply inspects the tag for the two relevant XLink attributes. It looks for them by namespace and local name. If it finds any for which it hasn't yet visited the URL, then it adds that URL to the end of the queue of URLs that need to be visited. The endDocument() method prints out the URL of the document it has just finished parsing. Then it retrieves the next URL from the top of the queue and parses it. This program is a little unusual in that not only does the XMLReader call back to the ContentHandler , but the ContentHandler also calls back to its XMLReader . The main() method reads the starting URL from the command line, constructs an XMLReader and a SAXSpider , and parses the initial URL. The program runs automatically from there. There's no limit to the depth or number of documents this spider will search, although currently the paucity of XLinked documents on the Web makes it unlikely that this program will run forever. Furthermore, because it isn't designed to run in parallel, there's little chance of it overwhelming anybody's server. Nonetheless, limiting its search depth would be a good feature to add. |