Working with XML

In this chapter, you find out how to work with XML-the best thing to happen to computing since the invention of the vacuum tubes, at least according to some over-enthusiastic prognosticators.

This chapter focuses on the basics of reading an XML document into memory and extracting data from it. With the background in this chapter, you shouldn't have much trouble studying the API documentation on your own to learn more about XML programming.

What Exactly Is XML, Anyway?

Most computer industry pundits agree that XML will completely change the way you work with computers. Here are just some of the ways XML will revolutionize the world of computers:

Yawn.

So what is XML, really? Simply put, XML is a way to store and exchange information in a standardized way that's easy to create, retrieve, and transfer between different types of computer systems or programs.

When XML is stored in a file, the file is usually given the extension xml.

Tags

Like HTML, XML uses tags to mark the data. For example, here's a bit of XML that describes a book:

Java All-In-One Desk Reference For Dummies Lowe

This chunk of XML defines an element called Book, which contains information for a single book. The Book element, in turn, contains two subordinate elements: Title and Author.

Notice how each element begins with a tag that lists the element's name. This tag is called the start tag. The element ends with an element that repeats the element name, preceded by a slash (an end tag).

Everything that appears between the start tag and the end tag is the element's content. An element's content can consist of text data, or it can consist of one or more additional elements. In the latter case, the additional elements nested within an element are called child elements, and the element that contains them is called the parent element.

The highest-level element in an XML document is called the root element. A properly formed XML document consists of a single root element, which can contain elements nested within it. For example, suppose you want to create an XML document with information about two movies. The XML document might look something like this:

It's a Wonderful Life 1946 14.95 The Great Race 1965 12.95

Here the root element named Movies contains two Movie elements, each of which contains Title, Year, and Price elements.

  Tip 

Although XML superficially resembles HTML, you find two key differences between XML and HTML:

  • The tags used in HTML indicate the format of data that displays. In contrast, tags in an XML document indicate the meaning of the data. For example, HTML has tags such as and that indicate data is bold or italic. In contrast, an XML document that holds information about books may have tags such as

    and that provide the title and author of the book.
  • The tags used in an HTML document are set in stone. In contrast, you can make up any tags you want to use in an XML document. If you're creating an XML document about cars, you may use tags such as , , and . But if you're creating an XML document about classes taught at a university, you may use tags such as ,

    , , , and .

Attributes

Instead of using child elements, you can use attributes to provide data for an element. An attribute is a name and value pair that's written inside of the start tag for an element. For example, here's a Movie element that uses an attribute instead of a child element to record the year:

It's 14.95

Whether you use attributes or child elements is largely a matter of personal preference. Many XML purists say that you should avoid attributes, or use them only for identifying data such as identification numbers or codes. Others say to use attributes freely. In my experience, a few attributes here and there don't hurt, but I avoid them for the most part.

The movies xml file

For your reference, Listing 5-1 shows the movies.xml file that the programs that appear later in this chapter use.

Listing 5-1: The movies.xml File

It's a 14.95 The Great Race 12.95 Young Frankenstein 16.95 The Return of the Pink Panther 11.95 Star Wars 17.95 The Princess Bride 16.95 Glory 14.95 Apollo 13 19.95 The Game 14.95 The Fellowship of the Ring 19.95

Using a DTD

An XML document can have a DTD, which spells out exactly what elements can appear in an XML document and in what order the elements can appear. DTD stands for Document Type Definition, but that won't be on the test.

For example, a DTD for an XML document about movies may specify that each Movie element must have Title and Price subelements and an attribute named year. It can also specify that the root element is named Movies and consists of any number of Movie elements.

  REMEMBER 

The main purpose of the DTD is to spell out the structure of an XML document so that users of the document know how to interpret it. But another equally important use of the DTD is to validate the document to make sure it doesn't have any structural errors. For example, if you create a Movies XML document that has two titles for a movie, you can use the DTD to detect the error.

You can store the DTD for an XML document in the same file as the XML data, but more often you store the DTD in a separate file. That way, you can use a DTD to govern the format of several XML documents of the same type. To indicate the name of the file that contains the DTD, you add a tag to the XML document. Here's an example:

 

Here the XML file is identified as a Movies document, whose DTD you can find in the file movies.dtd. Add this tag near the beginning of the movies.xml file, right after the tag.

Listing 5-2 shows a DTD file for the movies.xml file that was shown in Listing 5-1.

Listing 5-2: A DTD File for the movies.xml File

Each of the ELEMENT tags in a DTD defines a type of element that can appear in the document and indicates what can appear as the content for that element type. The general form of the ELEMENT tag is this:

This declaration indicates that the attribute goes with the Movie element, is named year, can be any kind of data, and is required.

Here's an ATTLIST tag that specifies a list of possible values along with a default:

This form of the ATTLIST tag lets you create an attribute that's similar to an enumeration, with a list of acceptable values.

Processing XML in Two Ways

In general, you can process XML documents in a Java program with two approaches. These two approaches are referred to as DOM and SAX:

In this chapter, I cover the basics of using DOM to retrieve information from an XML document. DOM represents an XML document in memory as a tree of Node objects. For example, Figure 5-1 shows a simplified DOM tree for an XML document that has two Movie elements. Notice that the root element (Movies) is a node, each Movie element is a node, and each Title and Price element is a node. In addition, text values are stored as child nodes of the elements they belong to. Thus, the Title and Price elements each have a child node that contains the text for these elements.

Figure 5-1: A DOM document tree.

Reading a DOM Document

Before you can process a DOM document, you have to read the document into memory from an XML file. You'd think that would be a fairly straightforward proposition, but unfortunately it involves some pretty strange incantations. Rather than go through all the classes and methods you have to use, I just look at the finished code for a complete method that accepts a String that contains a filename as a parameter and returns a document object as its return value. Along the way, you find out what each class and method does.

Here's the code:

private static Document getDocument(String name) { try { DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); factory.setIgnoringComments(true); factory.setIgnoringElementContentWhitespace(true); factory.setValidating(true); DocumentBuilder builder = factory.newDocumentBuilder(); return builder.parse(new InputSource(name)); } catch (Exception e) { System.out.println(e.getMessage()); } return null; }

Creating a document builder factory

The first statement of the preceding example calls the newInstance method of the DocumentBuilderFactory class to create a new DocumentBuilderFactory object. The job of the document builder factory is to create document builder objects that are able to read XML input and create DOM documents in memory.

Why not just call the DocumentBuilderFactory constructor? It turns out that DocumentBuilderFactory is an abstract class, so it doesn't have a constructor. newInstance is a static method that determines which class to create an instance of based on the way your system is configured.

Configuring the document builder factory

After you get a document builder factory, you can configure it so it reads the document the way you want. The next three statements configure three options that are applied to document builders created by this factory object:

factory.setIgnoringComments(true); factory.setIgnoringElementContentWhitespace(true); factory.setValidating(true);

Here's a closer look at these statements:

Creating a document builder and the document

After you set the options, you can call the newDocumentBuilder method to create a document builder:

DocumentBuilder builder = factory.newDocumentBuilder();

Here the document builder is referenced by a variable named builder.

Finally, you can create the DOM document by calling the parse method of the document builder. This method accepts an InputSource object as a parameter. Fortunately, the InputSource class has a constructor that takes a filename parameter and returns an input source linked to the file. So you can create the input source, parse the XML file, create a DOM document, and return the DOM document to the caller all in one statement:

return builder.parse(new InputSource(name));

Note that several of these methods throw exceptions. In particular, newDocumentBuilder throws ParserConfigurationException and parse throws IOException and SAXException. To keep this example simple, I caught all exceptions in one catch clause and printed the exceptions message to the console.

Using the getDocument method

By adding the getDocument method, you can create a DOM document from a file with a single statement, like this:

Document doc = getDocument("movies.xml");

Here the movies.xml file is read and a DOM document is created and assigned to the doc variable.

Also, note that you must provide three import statements to use the getDocument method:

import javax.xml.parsers.*; import org.w3c.dom.*; import org.xml.sax.*;

DocumentBuilder and DocumentBuilderFactory are in the javax.xml.parsers package, Document is in org.w3c.dom, and InputSource is in org.xml.sax.

  Tip 

Remember how I said I wouldn't use SAX in this chapter? I lied. The parse method of the DocumentBuilder class uses SAX to read the XML file while it builds the DOM object.

Reading DOM Nodes

After you have a DOM document in memory, you can easily retrieve data from the document's nodes. The DOM API is based on interfaces rather than classes, so each node of the DOM document is represented by an object that implements one or more DOM interfaces. The following paragraphs give you an overview of the interfaces you need to work with:

Processing elements

Assuming you use a DTD to validate the XML file when you build the document, you can usually navigate your way around the document to pick up information you need without resorting to NodeList objects. For example, here's a routine that simply counts all the Movie elements in the movies.xml file (shown earlier in Listing 5-1) after it's been parsed into a Document object named doc:

int count = 0; Element root = doc.getDocumentElement(); Node movie = root.getFirstChild(); while (movie != null) { count++; movie = movie.getNextSibling(); } System.out.println("There are " + count + " movies.");

This method first calls the getFirstChild method to get the first child of the root element. Then it uses each child element's getNextSibling method to get the next element that's also a child of the root element.

If you run a program that contains these lines, the following line appears on the console:

There are 10 movies.

This program doesn't do anything with the Movie elements other than count them, but you soon see how to extract data from the Movie elements.

An alternative way to process all the elements in the movies.xml file is to use the getChildNodes method to return a NodeList object that contains all the elements. You can then use a for loop to access each element individually. For example, here's a snippet of code that lists the name of each element:

Element root = doc.getDocumentElement(); NodeList movies = root.getChildNodes(); for (int i = 0; i < movies.getLength(); i++) { Node movie = movies.item(i); System.out.println(movie.getNodeName()); }

Here the item method is used in the for loop to retrieve each Movie element. If you run a program that contains these lines, ten lines with the word Movie are displayed in the console window.

Getting attribute values

To get the value of an element's attribute, call the getAttribute method and pass the name of the attribute as the parameter. This returns the string value of the attribute. You can then convert this value to another type if necessary. Note that the value may include some white space, so you should run the value through the trim method to get rid of any superfluous white space.

Here's an example that gets the year attribute from each movie in the movies.xml file and determines the year of the oldest movie in the collection:

Element root = doc.getDocumentElement(); Element movie = (Element)root.getFirstChild(); int oldest = 9999; while (movie != null) { String s = movie.getAttribute("year"); int year = Integer.parseInt(s); if (year < oldest) oldest = year; movie = (Element)movie.getNextSibling(); } System.out.println("The oldest movie in the file " + "is from " + oldest + ".");

The year attribute is extracted with these two lines of code:

String s = movie.getAttribute("year"); int year = Integer.parseInt(s);

The first line gets the string value of the year attribute, and the second line converts it to an int.

Notice the extra casting that's done in this method. It's necessary because the movie variable has to be an Element type so you can call the getAttribute method. However, the getNextSibling method returns a Node, not an Element. As a result, the compiler doesn't let you assign the node to the movie variable unless you first cast it to an Element.

Getting child element values

You might be surprised to learn that the text content of an element is not stored with the element. Instead, it's stored in a child node of type Text. For example, consider the following XML:

 

The Princess Bride

This element results in two nodes in the XML document: an Element node named Title, and a Text node that contains the text The Princess Bride.

Thus, if you have a Title element in hand, you must first get the Text element before you can get the text content. For example:

Node textElement = titleElement.getFirstChild(); String title = textElement.getNodeValue();

If you prefer to write your code a little more tersely, you can do it in a single statement like this:

String title = titleElement.getFirstChild().getNodeValue();

If you find this incantation a little tedious and you're doing a lot of it in your program, write yourself a little helper method. For example:

private static String getTextValue(Node n) { return n.getFirstChild().getNodeValue(); }

Then you can get the text content for an element by calling the getTextValue method, like this:

String title = getTextValue(titleElement);

After you get the text content, you can parse it to a numeric type if you need to.

Putting It All Together A Program That Lists Movies

Now that you've seen the various interfaces and classes you use to get data from an XML file, Listing 5-3 shows a complete program that reads the movies.xml file (shown earlier in Listing 5-1) and lists the title, year, and price of each movie on the console. When you run this program, the following appears on the console:

1946: It's 1965: The Great Race ($12.95) 1974: Young Frankenstein ($16.95) 1975: The Return of the Pink Panther ($11.95) 1977: Star Wars ($17.95) 1987: The Princess Bride ($16.95) 1989: Glory ($14.95) 1995: Apollo 13 ($19.95) 1997: The Game ($14.95) 2001: The Fellowship of the Ring ($19.95)

Listing 5-3: Reading an XML Document

import javax.xml.parsers.*; → 1 import org.xml.sax.*; import org.w3c.dom.*; import java.text.*; public class ListMoviesXML { private static NumberFormat cf = NumberFormat.getCurrencyInstance(); public static void main(String[] args) → 11 { Document doc = getDocument("movies.xml"); Element root = doc.getDocumentElement(); Element movieElement = (Element)root.getFirstChild(); Movie m; while (movieElement != null) { m = getMovie(movieElement); String msg = Integer.toString(m.year); msg += ": " + m.title; msg += " (" + cf.format(m.price) + ")"; System.out.println(msg); movieElement = (Element)movieElement.getNextSibling(); } } private static Document getDocument(String name) →29 { try { DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); factory.setIgnoringComments(true); factory.setIgnoringElementContentWhitespace(true); factory.setValidating(true); DocumentBuilder builder = factory.newDocumentBuilder(); return builder.parse(new InputSource(name)); } catch (Exception e) { System.out.println(e.getMessage()); } return null; } private static Movie getMovie(Element e) →49 { // get the year attribute String yearString = e.getAttribute("year"); int year = Integer.parseInt(yearString); // get the Title element Element tElement = (Element)e.getFirstChild(); String title = getTextValue(tElement).trim(); // get the Price element Element pElement = (Element)tElement.getNextSibling(); String pString = getTextValue(pElement).trim(); double price = Double.parseDouble(pString); return new Movie(title, year, price); } private static String getTextValue(Node n) → 65 { return n.getFirstChild().getNodeValue(); } private static class Movie → 70 { public String title; public int year; public double price; public Movie(String title, int year, double price) { this.title = title; this.year = year; this.price = price; } } }

Because all the code in this program is elsewhere in this chapter, the following paragraphs just provide a simple description of what each method in this program does:

1

Wow, that's a lot of packages to import. Too bad Java's designers couldn't have put all of these XML classes in one big package.

11

The main method starts by calling the getDocument method to get a Document object from the file movies.xml. Then it gets the root element and uses a while loop to spin through all the child elements, which you know to be Movie elements because the document was validated when it was parsed. As each Movie element is processed, it is passed to the getMovie method, which extracts the year attribute and the title and price elements and returns a Movie object. Then the movie is printed on the console.

29

The getDocument method accepts a filename as a parameter and returns a Document object. Before it creates the DocumentBuilder object, it sets the configuration options so that comments and white space are ignored and the XML file is validated. Because the XML file is validated, you must create a DTD file (like the file in Listing 5-2). You must also begin the XML file with a DOCTYPE declaration (such as ).

49

The getMovie method is passed an Element object that represents a Movie element. It extracts the year attribute, gets the text value of the title element, and parses the text value of the price element to a double. It then uses these values to create a new Movie object, which is then returned to the caller.

65

The getTextValue method is simply a little helper method that gets the text content from a node. This method assumes that the node has a child node that contains the text value, so you shouldn't call this method unless you know that to be the case. (Because the XML document was validated, you do.)

70

The Movie class is a private inner class that represents a single movie. It uses public fields to hold the title, year, and price, and provides a simple constructor that initializes these fields.

Категории