Exchanging PostgreSQL Data with XML

XML is the wave of the future. Well, it's a wave in some future anyway. XML was designed to let you and I write applications that can exchange structured data. An XML document is a self-describing textual representation of data, often structured in a hierarchical form. In this section, I'll assume that you have some knowledge of XML, XPath queries, and XSLT stylesheets. If you aren't familiar with those technologies, read onI'll show you a few examples that should help you understand the basic concepts.

You can store XML data in a PostgreSQL database without any help from third-party software. For example, let's say that one of your distributors offers a new service to the video store that you're running. Every so often the distributor sends you an XML document that describes a number of films. A typical document is shown in Listing 25.1.

Listing 25.1. films.xml

Casablanca 1942 Murray Burnett Joan Alison Julius J. Epstein Philip G. Epstein Howard Koch Humphrey Bogart Ingrid Bergman Peter Lorre Michael Curtiz Rear Window 1954 Cornell Woolrich John Michael Hayes James Stewart Grace Kelly Raymond Burr Alfred Hitchcock The Godfather 1972 Mario Puzo Francis Ford Coppola Marlon Brando Al Pacino James Caan Robert Duvall Diane Keaton Talia Shire Francis Ford Coppola

This document (films.xml) describes three films: Casablanca, Rear Window, and The Godfather. Each description contains a name, a year (the year that the film was released), a collection of writers, a collection of leads (leading actors and actresses), and a collection of directors.

To store this document in a PostgreSQL database, you could simply INSERT the whole thing into a TEXT column. In practice, you'd probably want to split the document into separate records (one for each film) and store each description in a separate row. Let's do that. First, create a table (filminfo) that will hold the film descriptions like this:

$ psql movies Welcome to psql 8.0.0, the PostgreSQL interactive terminal. ... movies=# CREATE TABLE filminfo movies-# ( movies(# film_name VARCHAR PRIMARY KEY, movies(# description TEXT movies(# ); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "filminfo_pkey" for table "filminfo" CREATE TABLE

Now you have a container, but how do you get the XML objects into the filminfo table? The distributor has given you an XML document; you want to split that document into separate objects and then INSERT those objects into the filminfo table.

Sounds like a perfect job for XSLT (Extensible Stylesheet Language Transformations). Listing 25.2 shows an XSLT document (splitFilms.xsl) that will do the trick.

Listing 25.2. splitFilms.xsl

1 3 4 5 6 7 INSERT INTO filminfo VALUES(' 8 9 ',' 10 11 '); 12 13 14 15

splitFilms.xsl will parse through a list of films and create an INSERT command for each film that it finds. Given the first film described in films.xml (see Listing 25.1), splitFilms.xsl will produce an INSERT command that looks like this:

INSERT INTO filminfo VALUES('Casablanca',' Casablanca 1942 Murray Burnett Joan Alison Julius J. Epstein Philip G. Epstein Howard Koch Humphrey Bogart Ingrid Bergman Peter Lorre Michael Curtiz ');

You can save the INSERT commands to a text file or, better yet, just pipe the output produced by splitFilms.xsl directly into the psql command. If you're using the libxslt package, you can execute splitFilms.xsl like this:

$ xsltproc splitFilms.xsl films.xml | psql movies INSERT 846648 1 INSERT 846649 1 INSERT 846650 1

There are three films described in films.xml and psql reports three INSERT commandsthat's a good sign.

Now that you have XML documents in your database, you can use them just like any other TEXT value. You can search inside of an XML document using PostgreSQL's regular expression operators. For example, to find films starring Jimmy Stewart, you could execute the following query:

movies=# SELECT film_name FROM filminfo movies-# WHERE description ~* '.*James Stewart.*'; film_name ------------ Rear Window (1 row)

You can treat an XML document just like any other TEXT value, but you can add a number of XML-specific features by installing the xml2 contributed module.

XPath Queries

xml2 (also known as pgxml) is a small collection of functions that let you execute XPath queries against XML documents stored in a PostgreSQL database. xml2 does not turn your database into an XML databaseyou can't execute XPath queries in place of SELECT statements. Instead, xml2 lets you include XPath queries inside of SELECT (and other statements).

xml2 defines three functions that will return a single value from an XPath query:

xpath_string( document, query ) RETURNS TEXT xpath_number( document, query ) RETURNS FLOAT4 xpath_bool( document, query) RETURNS BOOL

When you call one of the xml2 query functions, you provide an XML document and an XPath query. The query function returns the object (a number, string, Boolean value, list of values, or a nodeset) identified by the XPath. If that sounds confusing, it may help to look at an example. Here's a query that extracts the year node stored within a film description:

movies=# SELECT film_name, xpath_string( description, 'year' ) FROM filminfo; film_name | xpath_string ---------------+------------- Casablanca | 1942 Rear Window | 1954 The Godfather | 1972 (3 rows)

In this case, the xpath_string() function is invoked three times (because there are three rows in the filminfo table). In each invocation, the description column acts as an XML document and 'year' is an XPath query that navigates through the document until it finds the year element.

You can also use the XPath query functions in other parts of the SELECT command, such as the WHERE clause. For example, to find all films released before 1960:

movies=# SELECT film_name FROM filminfo movies-# WHERE xpath_number( description, '/film/year' ) < 1960; film_name - Casablanca Rear Window (2 rows)

In fact, you can use the result of an XPath query as a table. (I'll show you how to do that in a moment.)

xml2 defines five functions that return multiple values from an XPath query. The first set of functions return a nodeset in the form of a TEXT string (a nodeset is a collection of XML nodes):

xpath_nodeset( document, query, topTag, itemTag ) RETURNS TEXT xpath_nodeset( document, query, itemTag ) RETURNS TEXT xpath_nodeset( document, query ) RETURNS TEXT

Use the nodeset functions when you want to extract a set of values from an XML document and you want the result to retain the XML tags present in the document. For example, to find all writers for a given film:

movies=# SELECT xpath_nodeset( description, 'writers' ) FROM filminfo movies-# WHERE film_name = 'Rear Window'; xpath_nodeset -------------- Cornell Woolrich John Michael Hayes (1 row)

If you include a topTag or itemTag, xpath_nodeset() will wrap the entire nodeset in the topTag and wrap each item inside of an <itemTag> itemTag> pair. The extra tags are useful if you are building a new XML document out of data already in the database.

To convert a nodeset into a more conventional (and often more useful form), use the xpath_list() function.

movies=# SELECT xpath_list( description, 'leads/lead') FROM filminfo movies-# WHERE film_name = 'The Godfather'; xpath_list -------------------------------------------------------------------------- Marlon Brando,Al Pacino,James Caan,Robert Duvall,Diane Keaton,Talia Shire (1 row)

xpath_list() expects two or three arguments. If you call xpath_list() with three arguments, the last argument determines the string that separates each element in the list (the default separator is ",").

The last XPath function provided by xml2 is xpath_table() and it can be somewhat confusing. xpath_table() creates a tabular result set by executing an XPath query (or a series of queries separated by "|") against a table (or view). Here's an example:

movies=# SELECT * FROM movies-# xpath_table('film_name','description','filminfo','year|leads/*','1=1') movies-# AS t(film_name text, year text, leads text); film_name | year | leads ---------------+------+--------------- Casablanca | 1942 | Humphrey Bogart Casablanca | | Ingrid Bergman Casablanca | | Peter Lorre Rear Window | 1954 | James Stewart Rear Window | | Grace Kelly Rear Window | | Raymond Burr The Godfather | 1972 | Marlon Brando The Godfather | | Al Pacino The Godfather | | James Caan The Godfather | | Robert Duvall The Godfather | | Diane Keaton The Godfather | | Talia Shire (12 rows)

The AS clause tells PostgreSQL the shape of the resulting table. (xpath_table() is defined to a return a SETOF RECORDSsince a RECORD has no predefined shape, you have to tell PostgreSQL what shape to expect.)

xpath_table() expects five arguments:

xpath_table( key, document, table, xpathQueries, condition )

xpath_table() creates a SELECT command (based on the arguments that you provide) and then executes that command. Next, xpath_table() reads through each row returned by the SELECT command and evaluates the XPath queries against the document column.

The SELECT command is constructed from the key, document, table, and condition arguments. After executing the SELECT command, xpath_table() verifies that the query returned exactly two columns (a key and a document) and then reads through each row in the result set, evaluating each XPath query that you provide. For each XPath query, xpath_table() evaluates that query and, if it returns a value, stores that value in an intermediate tuple. When the tuple is complete, xpath_table() adds its to the final result set. If you invoke xpath_table() with two or more XPath queries that return nodesets of differing sizes, the final result set will contain NULL values.

That's a rather complex description that might be better illustrated by walking through the process one step at a time.

Given the arguments

xpath_table('film_name','description','filminfo','year|leads/*','1=1')

xpath_table() starts by splitting the xpathQueries argument (year|leads/*) into individual queries. In this case, xpath_table() finds two queries: year and leads/*. Since you've supplied two queries, the result set produced by xpath_table() will contain three columns: the first column will contain the key field (film_name), the second column will contain the result of the first XPath query (year), and the third column will contain the result of the send XPath query (leads/*).

Next, xpath_table() pastes together a SELECT command that looks like this:

SELECT film_name, description FROM filminfo WHERE 1=1;

xpath_table() always selects two columns from the table: the key and the document. The condition argument is tacked on to the end of the command in the form of a WHERE clauseyou must provide a condition even if you want to process every row in the given table[1].

[1] The condition argument is just tacked onto the end of the SELECT commandyou can include any text that can legally follow the word WHERE in a SELECT command. In fact, you can force xpath_table() to join two tables by listing them both in the table argument and specifying a join in the condition argument.

Next, xpath_table() executes the SELECT command and loops through each row that makes it through the WHERE clause. xpath_table() constructs one or more tuples out of each row returned by the SELECT command. To fill in the first column in each new tuple, xpath_table() simply copies the key column (film_name) from the row returned by the SELECT command. To fill in the remaining columns, xpath_table() evaluates each XPath query against the document column. In this example, the first document (the description column) returned by the SELECT command looks like this:

Casablanca 1942 Murray Burnett Joan Alison Julius J. Epstein Philip G. Epstein Howard Koch Humphrey Bogart Ingrid Bergman Peter Lorre Michael Curtiz

The first XPath query (year) nabs a nodeset that contains a single value (1942). The second XPath query (leads/*) returns a nodeset containing three values (Humphrey Bogart, Ingrid Bergman, and Peter Lorre). For each row returned by the SELECT command, xpath_table() produces one or more tuples. The number of new tuples is determined by the largest nodeset returned by the XPath queries. In this case, the largest nodeset contains three nodes, so xpath_table() will add three tuples to the final result set. To form the new tuples, xpath_table() copies the key value ("Casablanca") into the first column of each new tuple, and then starts copying the nodes into the remaining columns. If xpath_table() runs out of nodes for a given column (and it will run out when the nodesets differ in length), it writes a NULL value into the tuple instead.

When it finishes with the first row, xpath_table() repeats the process for each of the remaining rows. When it hits the second row (Rear Window), xpath_table() again finds that the nodeset produced by the first XPath query contains a single value (1954) and the second nodeset contains three valuesthat means three more rows in the final result set. The last row returned by the SELECT command (The Godfather) produces one nodeset that contains a single value (1972) and second nodeset that contains six values, so xpath_table() adds six more rows to the final result set.

You can infer a few rules from this walk-through:

You can see that the XPath query functions are powerful but they can also be unwieldy. You can simplify the xml2 functions by wrapping them in custom-made views and functions.

For example, you can easily create a view that uses xpath_table() to extract leading actors and actresses like this:

movies=# CREATE VIEW film_leads AS movies-# SELECT * FROM movies-# xpath_table('film_name','description','filminfo','leads/*','1=1') movies-# AS t(name text, leads text); CREATE VIEW test=# SELECT * FROM film_leads WHERE name = 'Casablanca'; name | leads ------------+---------------- Casablanca | Humphrey Bogart Casablanca | Ingrid Bergman Casablanca | Peter Lorre (3 rows)

That's much better. Of course, you can treat a view built from xpath_table() just like any other table or view. For example, you could join the film_leads view and the tapes table to produce of a list of all leading actors and actresses starring in the films that you have in stock:

movies=# SELECT DISTINCT ON( title, leads ) tape_id, title, leads movies-# FROM tapes, film_leads WHERE name = title; tape_id | title | leads ----------+---------------+---------------- MC-68873 | Casablanca | Humphrey Bogart MC-68873 | Casablanca | Ingrid Bergman MC-68873 | Casablanca | Peter Lorre AH-54706 | Rear Window | Grace Kelly AH-54706 | Rear Window | James Stewart AH-54706 | Rear Window | Raymond Burr AB-67472 | The Godfather | Al Pacino AB-67472 | The Godfather | Diane Keaton AB-67472 | The Godfather | James Caan AB-67472 | The Godfather | Marlon Brando AB-67472 | The Godfather | Robert Duvall AB-67472 | The Godfather | Talia Shire (12 rows)

(the DISTINCT ON clause weeds out any duplicates in case you have multiple copies of the same video, each with a different tape_id.)

You can also simplify the XPath query functions (xpath_string(), xpath nodeset(), xpath_list(), and so on) by wrapping them in more convenient forms. In fact, you don't even have to resort to a procedural language (such as PL/pgSQL or Java)you can write the wrapper functions in SQL. For example, the script shown in Listing 25.3 creates a function that returns a comma-separated list of the leads actors and actresses starring in a given film.

Listing 25.3. starring.sql

Filename: starring.sql CREATE FUNCTION starring( title TEXT ) RETURNS TEXT AS $$ SELECT xpath_list( description, 'leads/*') FROM filminfo WHERE film_name = $1 $$ LANGUAGE 'SQL';

You can call this function in the select-list part of a SELECT command, in the WHERE clause, or in both parts:

movies=# SELECT tape_id, title, starring( title ) FROM tapes movies-# WHERE starring( title ) LIKE '%James Stewart%'; tape_id | title | starring ----------+-------------+-------------------------------------- AH-54706 | Rear Window | James Stewart,Grace Kelly,Raymond Burr (1 row)

That query returns the leading actors and actresses who star in any video that you stock that features James Stewart.

Converting XML Data with XSLT

At the beginning of the previous section, I showed you how to use XSLT to translate an XML document into a sequence of INSERT commands. XSLT can convert any XML document into (just about) any other form.

The xml2 contributed module includes an XSLT processor that you can invoke from within the PostgreSQL server. (Actually, xml2 includes an interface to the libxslt package.) That means that you can use XSLT to convert XML documents stored inside of your database, without ever leaving the comfort of your favorite PostgreSQL client.

XSLT is often used to produce HTML web pages from XML documents, and in this section, I'll show you how to turn the XML documents stored in the filminfo table into user-friendly web pages.

To convert an XML document using an XSLT stylesheet, call the xslt_process() function. xslt_process() expects two arguments:

xslt_process( document TEXT, stylesheet TEXT ) RETURNS TEXT

It may seem obvious, but it's worth pointing out that you can provide either argument as a TEXT literal, as an expression that evaluates to a TEXT value, or as a reference to a column in the database. The most convenient way to use xslt_process() is to store both the XML document and the XSLT stylesheet in a PostgreSQL table (probably in two separate tables).

To start this exercise, I'll create a table that will hold XSLT stylesheets:

movies=# CREATE TABLE transforms( name VARCHAR, stylesheet TEXT ); CREATE TABLE

Next, I'll create a function named stylesheet() that will retrieve a stylesheet from the transforms table, given the name of the desired styelsheet:

movies=# CREATE FUNCTION stylesheet( name VARCHAR ) RETURNS TEXT AS movies-# $$ movies$# SELECT stylesheet FROM transforms WHERE name = $1 movies$# $$ LANGUAGE 'SQL'; CREATE FUNCTION

At this point, the movies database stores XML documents inside of the filminfo table and stores XSLT stylesheets in the TRansforms table. Listing 25.4 shows a script that will add a new stylesheet to the TRansforms table.

Listing 25.4. movieOfTheWeek.sql

INSERT INTO transforms VALUES( 'movieOfTheWeek', $$ Movie of the week

This week's movie

 

Starring:

Directed by:

Released in:

$$ );

 

When you execute this script, you're storing an XSLT stylesheet in the transforms table. I won't walk you through the workings of the stylesheet; XSLT is a powerful but unfriendly language and I can't really do it justice in this book, so pick up a good XSLT book if you need more information. I will point out a very important rule, thoughthe XSLT processor included with xml2 examines the first character in the stylesheet to decide whether you've given it a real stylesheet or a reference to a remote stylesheet. If the first character is a "<", the XSLT processor assumes that the string is the stylesheet. If the first character is anything other than a "<", the XSLT processor assumes that you've given it a URI that identifies the real stylesheet (that is, the string is treated as if it were something like "http://example.com/the-real-stylesheet.xsl"). When you INSERT (or UPDATE) a stylesheet into a table, make sure that the opening "<" immediately follows the quote character(s)you don't want a newline between the "$$" and the "<" or you'll spend a great deal of time increasing your four-letter vocabulary.

To apply the stylesheet shown in Listing 25.4, choose a film that you want to feature (say, Casablanca) and execute the following commands:

movies=# Showing only tuples. movies=# o movieOfTheWeek.html movies=# SELECT xslt_process( description, stylesheet( 'movieOfTheWeek' )) movies-# FROM filminfo movies-# WHERE film_name = 'Casablanca'; movies=# o movies=# Tuples only is off

 

The first command ( ) turns on psql's "tuple-only mode." In this mode, psql omits the column headers (and command responses) that you normally see when you execute a queryyou only see the data values returned by each query. The second command (o movieOfTheWeek.html) tells psql to create a new file named movieOfTheWeek.html (in the current directory) and write command results to that file. Once you've executed those two commands, psql will write raw query results (tuples only) to movieOfTheWeek.html.

As I mentioned earlier, xslt_process() expects two arguments. The first argument tells xslt_process() to convert the XML document found in filminfo's description column. The second argument specifies the stylesheet that xslt_process() will use to control the conversion process. The SELECT command calls the stylesheet() function I defined earlier (at the beginning of this section) to retrieve the movieOfTheWeek stylesheet.

When the SELECT command completes, the o command closes movieOfTheWeek .html and redirects psql output to your screen. The command turns tuple-only mode off again so that you'll see column headers (and command tags).

The overall result here is that you've produced an HTML document (movieOfTheWeek.html) by converting the an XML document using an XSLT stylesheet. If you view movieOfTheWeek.html in a web browser, you'll see a page similar to that shown in Figure 25.1.

Figure 25.1. movieOfTheWeek.html.

Using Full text Search

Категории