Beginning XML Databases (Wrox Beginning Guides)
| ||
| ||
|
The Document Type Definition (DTD) allows the definition for the building blocks of an XML document. In other words, you create a DTD document. That DTD document contains the structural definition for the data in an XML document. That DTD definition can be used as a mapping structure, mapping between the metadata; plus it can be used as a data mix in an XML document, and the metadata table structure of a relational database model. The result is a DTD document, which can be used to validate the structure of an XML document. Errors in XML data, XML metadata, or XML document structure can be detected and thus repaired. For example, passing data between two relational databases can be managed nicely using DTDs. Each relational database can have different table structures, perhaps even be different vendor database engines. DTD documents can be used at both ends to validate XML document structure before attempting to add newly transferred XML data into a relational database. This can help to avoid data errors in the relational databases.
A DTD document begins with a document type declaration in its simplest form, as shown here:
<!DOCTYPE root [ <all the elements in the document> ]>
Other parts of a DTD declaration are elements, attributes, entities, PCDATA, and CDATA:
-
Elements: Defines XML document elements.
-
Attributes: Attribute and attribute values, as part of elements.
-
Entities: An entity is essentially a reference, including an escape sequence such as & or an included DTD definition, which has been given a name and is referenced elsewhere by that name , as in &entity .
-
PCDATA: XML document, element text values of an XML document, which has already been successfully parsed.
-
CDATA: Character data, which is not to be parsed by an XML parser.
For example, regions containing countries , which in turn contain cities, might look something like this:
<?xml version="1.0"?> <!DOCTYPE demographics [ <!ELEMENT demographics (region*)> <!ELEMENT region (name, area, country*)> <!ELEMENT name (#PCDATA)> <!ELEMENT area (#PCDATA)> <!ELEMENT country (name, population, area, city*)> <!ELEMENT name (#PCDATA)> <!ELEMENT population (#PCDATA)> <!ELEMENT area (#PCDATA)> <!ELEMENT city (name, population)> <!ELEMENT name (#PCDATA)> <!ELEMENT population (#PCDATA)> ]> <demographics> <region> <name>Europe</name> <area>4583335</area> ... <country> <name>Belgium</name> <population>10000000</population> <area>30230</area> <city> <name>Antwerp</name> <population>1125000</population> </city> <city> <name>Brussels</name> <population>1875000</population> </city> </country> ... <region> ... </region> ... </demographics>
The preceding example is easy to understand. All that is being done is that the DTD definition is creating a formal structure for the XML demographics data. The only new factor in the preceding script is the use of the asterisk (*) character. The * character indicates that there are zero or more of the indicated elements included within the parent element. For example, this indicates that the <demographics> element contains multiple <region> elements:
<!ELEMENT demographics (region*)>
Do not omit the space character between element name and content, as in demographics (region...
Similarly, the same applies to regions, countries, and cities as well:
<!ELEMENT region (name, population, area, country*)> <!ELEMENT country (name, population, area, city*)>
One more thing that can be done with the DTD DOCTYPE declaration is that the DTD definition can be stored into a separate file. So the externally stored DTD file would be like this, and would be called something like demographics.dtd:
<!ELEMENT demographics (region*)> <!ELEMENT region (name, area, country*)> <!ELEMENT name (#PCDATA)> <!ELEMENT area (#PCDATA)> <!ELEMENT country (name, population, area, city*)> <!ELEMENT name (#PCDATA)> <!ELEMENT population (#PCDATA)> <!ELEMENT area (#PCDATA)> <!ELEMENT city (name, population)> <!ELEMENT name (#PCDATA)> <!ELEMENT population (#PCDATA)>
And the XML document, including the DTD definition from the file called demographics.dtd would look like this:
<?xml version="1.0"?> <!DOCTYPE demographics SYSTEM "demographics.dtd"> <demographics> <region> <name>Europe</name> <area>4583335</area> ... <country> <name>Belgium</name> <population>10000000</population> <area>30230</area> <city> <name>Antwerp</name> <population>1125000</population> </city> <city> <name>Brussels</name> <population>1875000</population> </city> </country> ... <region> ... </region> ... </demographics>
Executing the XML document in a browser does not really show you much other than that shown in Figure 13-1.
DTD Elements
The basic syntax for declaring a DTD element is as follows :
<!ELEMENT element { <category> (content [, ...]) }>
DTD Element Categories
An element can contain a category of data, such as EMPTY , or a subset of other elements. A completely empty element can be defined as follows:
<!ELEMENT element EMPTY>
The EMPTY keyword is applicable to something such as an HTML tag not requiring both opening and closing tags:
<!ELEMENT <HR/> EMPTY>
or an XML tag containing no data content:
<!ELEMENT <year/> EMPTY>
Any combination of anything that is parseable by XML can be defined using the ANY keyword:
<!ELEMENT element ANY>
DTD Element Content
So, thats the basic syntax for DTD element definitions. Elements can also contain what is called content, as opposed to just a category item. That content can be a single item, such as a single string value as defined by PCDATA:
<!ELEMENT element (#PCDATA)>
Content can also be a sequence of subset elements where an element contains a subset tree of subset elements. The syntax is as follows:
<!ELEMENT element (child [, ... ])>
For example, this <country> element contains a list of three subset elements (name, population, and area):
<!ELEMENT country (name, population, area)>
The child elements must be defined in the same sequence, immediately after the parent element definition:
<!ELEMENT name (#PCDATA)> <!ELEMENT population (#PCDATA)> <!ELEMENT area (#PCDATA)>
Content can also be a set of choices, which are separated by a character (sometimes called a pipe command or an OR operator):
<!ELEMENT element (child1 child2)>
The preceding code means the parent element can contain either of the child elements. This even applies to sequences, as in the following syntax example:
<!ELEMENT element (childA, childB, (child1 child2))>
In the following example, the country can optionally contain a collection of languages, or a list of occupations, but in this case not both:
<!ELEMENT country (name, population, area, (languages occupations))> <!ELEMENT name (#PCDATA)> <!ELEMENT population (#PCDATA)> <!ELEMENT area (#PCDATA)> <!ELEMENT languages (name, male, female)> <!ELEMENT name (#PCDATA)> <!ELEMENT male (#PCDATA)> <!ELEMENT female (#PCDATA)> <!ELEMENT occupations (name, male, female)> <!ELEMENT name (#PCDATA)> <!ELEMENT male (#PCDATA)> <!ELEMENT female (#PCDATA)>
Content can even be of mixed content. However, mixed content in DTD is severely restricted to a single item, which contains a repetition of one or more elements. The mixed context part of that one or more elements is that each of those elements can be a text item or an element. So the following example causes an error because it includes three separate subelements:
<?xml version="1.0"?> <!DOCTYPE demographics [ <!ELEMENT demographics (region)> <!ELEMENT region (#PCDATA, population, area)> ]> <?xml version="1.0"?> <demographics> <region>Africa <population>789548670</population> <area>26780325</area> </region> </demographics>
The only way that mixed content can be defined with DTD is by mixing multiple items into the same optional context, as in the following example including text, <B> , <I> , and <P> tags within the description string text value:
<?xml version="1.0"?> <!DOCTYPE demographics [ <!ELEMENT demographics (region)> <!ELEMENT region (name, population, area, description)> <!ELEMENT description (#PCDATA B I P)*> ]> <demographics> <region> <name>Africa</name> <population>789548670</population> <area>26780325</area> <description> <P>This region has a dense population in <B>high</B> rainfall areas and is sparsely populated in very <I>low</I> rainfall areas.</P><P>Africa is one of the largest regions of the world, when measured in <B><I>square miles</I></B>.</P> </description> </region> </demographics>
Note the use of the asterisk (*) character in the preceding script. This is required as it denotes there can be zero or more of one or more of the listed options (i.e. #PCDATA, B, I, P). This is called cardinality, which I will deal with in the next section.
The result is shown in Figure 13-2.
DTD Element Cardinality
Cardinality determines how many times an item or element can occur within a specific content layer. There are four DTD cardinality syntax specifiers:
-
* : Zero or more times.
-
+ : One or more times (not zero). Thus a minimum of once.
-
? : Once and only once, or not at all.
-
[none] : Must be included once and only once.
In the following example, there can be zero or more <language> elements contained within each <languages> element:
<?xml version="1.0"?> <!DOCTYPE demographics [ <!ELEMENT demographics (region)> <!ELEMENT region (name, population, area, country)> <!ELEMENT country (name, languages)> <!ELEMENT languages (language*)> <!ELEMENT language (name, male, female)> ]> <demographics> <region> <name>Africa</name> <population>789548670</population> <area>26780325</area> <country> <name>Algeria</name> <languages> <language> <name>French</name> <male>37500</male> <female>40100</female> </language> <language> <name>Berbere</name> <male>1123200</male> <female>1144100</female> </language> <language> <name>Arabic</name> <male>4908100</male> <female>4826000</female> </language> </languages> </country> </region> </demographics>
The result is shown in Figure 13-3.
Now you can change the preceding example, removing the <languages> elements from the XML document altogether:
<?xml version="1.0"?> <!DOCTYPE demographics [ <!ELEMENT demographics (region)> <!ELEMENT region (name, population, area, country)> <!ELEMENT country (name, languages)> <!ELEMENT language (name, male, female)*> ]> <demographics> <region> <name>Africa</name> <population>789548670</population> <area>26780325</area> <country> <name>Algeria</name> <language> <name>French</name> <male>37500</male> <female>40100</female> </language> <language> <name>Berbere</name> <male>1123200</male> <female>1144100</female> </language> <language> <name>Arabic</name> <male>4908100</male> <female>4826000</female> </language> </country> </region> </demographics>
See how the asterisk (*) character has been moved from the collection parent <languages> (now no longer in existence) to the collection itself in the form of the <language> elements. The result is shown in Figure 13-4.
DTD Attributes
Attributes are defined in DTDs, for an XML document data, using what is called an ATTLIST declaration. Attributes are defined using the following syntax:
<!ATTLIST element attribute type default>
For example, this definition:
<!ATTLIST region id CDATA "1">
represents this piece of XML:
<region id="1">
And similarly, this definition containing multiple attributes:
<!ATTLIST year year CDATA "1950"> <!ATTLIST year population_id CDATA "12009"> <!ATTLIST year population CDATA "8892718">
represents this piece of XML:
<year year="1950" population_id="12009" population="8892718">
Attribute Types
Attributes can be defined as being of various different types, as shown in Figure 13-5.
The following example shows an enumerated list of various different values. So, the year attribute in the following example can be set to any year between 2003 and 2006, even though initially defaulted to 2006:
<!ATTLIST year year (2003200420052006) "2006">
Attribute Defaults
Attributes can be assigned default values in their DTD definition, as shown in Figure 13-6.
Lets use the previous example once again: In the following definition, of the three attributes defined, all have default values of year set to 1950, population_id set to 12009, and population set to 8892718:
<!ATTLIST year year CDATA "1950"> <!ATTLIST year population_id CDATA "12009"> <!ATTLIST year population CDATA "8892718">
A setting of #IMPLIED dictates that the XML data does not require that an element actually must always have an attribute. Thus, the preceding example can be changed as shown below:
<!ATTLIST year year CDATA #IMPLIED> <!ATTLIST year population_id CDATA #IMPLIED> <!ATTLIST year population CDATA #IMPLIED>
The preceding DTD definition implies that attributes are not necessarily required for a specific element, and so this is valid XML:
<year year="1950" population_id="12009" population="8892718">
And this is valid:
<year year="1950" population="8892718">
And even this is valid, too:
<year>
Now if you change the default value to #REQUIRED , as shown here:
<!ATTLIST year year CDATA #REQUIRED> <!ATTLIST year population_id CDATA #REQUIRED> <!ATTLIST year population CDATA #REQUIRED>
Then this:
<year year="1950" population="8892718">
and the example that follows, are now both illegal for this XML document because none of the attributes can be omitted:
<year>
Now if you change the default values again as follows, the year must always be 2006, the population_id value is required, and the population number is not absolutely required:
<!ATTLIST year year CDATA #FIXED "2006"> <!ATTLIST year population_id CDATA #REQUIRED> <!ATTLIST year population CDATA #IMPLIED>
So this is valid where the population_id is required but can be anything, and the year must be 2006:
<year year="2006" population_id="12009">
In the preceding example, the population figure is optional.
Now you can enhance the definition for a small section of the demographics.xml database, adding some attributes:
<?xml version="1.0"?> <!DOCTYPE demographics [ <!ELEMENT demographics (region*)> <!ELEMENT region (name, population, area, country*)> <!ATTLIST region id CDATA #REQUIRED> <!ELEMENT name (#PCDATA)> <!ELEMENT population (#PCDATA)> <!ELEMENT area (#PCDATA)> <!ELEMENT country (name, population)> <!ATTLIST country id CDATA #REQUIRED> <!ATTLIST country code CDATA #IMPLIED> <!ELEMENT name (#PCDATA)> <!ELEMENT population (year*)> <!ATTLIST year year (200320042005200620102020) "2006"> <!ATTLIST year population_id CDATA #REQUIRED> <!ATTLIST year population CDATA #IMPLIED> <!ELEMENT year (births_per_1000?, deaths_per_1000?, growth_rate?)> <!ELEMENT births_per_1000 (#PCDATA)> <!ELEMENT deaths_per_1000 (#PCDATA)> <!ELEMENT growth_rate (#PCDATA)> ]> <demographics> <region id="1"> <name>Africa</name> <population>789548670</population> <area>26780325</area> <country id="1" code="AG"> <name>Algeria</name> <population> <year year="2003" population_id="410" population="31713719"> <births_per_1000>18.34</births_per_1000> <deaths_per_1000>4.63</deaths_per_1000> <natural_increase_percent>1.371</natural_increase_percent> <growth_rate>1.329</growth_rate> </year> <year year="2004" population_id="411" population="32129324"> <births_per_1000>17.76</births_per_1000> <deaths_per_1000>4.61</deaths_per_1000> <natural_increase_percent>1.315</natural_increase_percent> <growth_rate>1.275</growth_rate> </year> <year year="2005" population_id="412" population="32531853"> <births_per_1000>17.13</births_per_1000> <deaths_per_1000>4.6</deaths_per_1000> <natural_increase_percent>1.253</natural_increase_percent> <growth_rate>1.216</growth_rate> </year> <year year="2006" population_id="413"/> <year year="2010" population_id="417"/> <year year="2020" population_id="427"/> </population> </country> </region> </demographics>
The result of the preceding DTD definition and XML data is shown in Figure 13-7.
The only oddity about Figure 13-7 is that I removed the <natural_increase_percent> element from the DTD definition, but not from the XML data. Internet Explorer 6.0 did not produce an error.
DTD Entities
An entity is used to define a variable, and can be built-in or custom made. The syntax for declaring an entity is as follows:
<!ENTITY entity "value">
Built-In Entities and ASCII Code Character Entities
Built-in entities are essentially any reference entities usable in XML, which are escape sequence commands outputting a single character that is otherwise interpreted as something else by XML. The built-in entities are as follows:
-
& : Display an & character.
-
< : Display a < character.
-
> : Display a > character.
-
' : Display a single quotation character.
-
" : Display a double quotation character.
An ASCII code character entity, as with XML and HTML, allows you to utilize an ASCII numeric code to display any character in the ASCII character set. For example,   will display a space character, or a œ will display the currency symbol for British Pounds ().
Custom and Parameter Entities
A custom created entity is one created to allow repeated inclusion of a piece of text. So, in the example that follows, some unknown values occur for some of the elements: births_per_1000, growth_rate, and deaths_per_1000. The result is that an unknown value will be replaced by the entity string value for the unknown entity:
<?xml version="1.0"?> <!DOCTYPE demographics [ <!ELEMENT demographics (region*)> <!ELEMENT region (name, population, area, country*)> <!ATTLIST region id CDATA #REQUIRED> <!ELEMENT name (#PCDATA)> <!ELEMENT population (#PCDATA)> <!ELEMENT area (#PCDATA)> <!ELEMENT country (name, population)> <!ATTLIST country id CDATA #REQUIRED> <!ATTLIST country code CDATA #IMPLIED> <!ELEMENT name (#PCDATA)> <!ELEMENT population (year*)> <!ATTLIST year year (200320042005200620102020) "2006"> <!ATTLIST year population_id CDATA #REQUIRED> <!ATTLIST year population CDATA #IMPLIED> <!ELEMENT year (births_per_1000?, deaths_per_1000?, growth_rate?)> <!ELEMENT births_per_1000 (#PCDATA)> <!ELEMENT deaths_per_1000 (#PCDATA)> <!ELEMENT growth_rate (#PCDATA)> <!ENTITY unknown "No value available"> ]> <demographics> <region id="1"> <name>Africa</name> <population>789548670</population> <area>26780325</area> <country id="1" code="AG"> <name>Algeria</name> <population> <year year="2003" population_id="410" population="31713719"> <births_per_1000>&unknown;</births_per_1000> <deaths_per_1000>4.63</deaths_per_1000> <natural_increase_percent>1.371</natural_increase_percent> <growth_rate>&unknown;</growth_rate> </year> <year year="2004" population_id="411" population="32129324"> <births_per_1000>17.76</births_per_1000> <deaths_per_1000>&unknown;</deaths_per_1000> <natural_increase_percent>1.315</natural_increase_percent> <growth_rate>1.275</growth_rate> </year> <year year="2005" population_id="412" population="32531853"> <births_per_1000>17.13</births_per_1000> <deaths_per_1000>4.6</deaths_per_1000> <natural_increase_percent>1.253</natural_increase_percent> <growth_rate>1.216</growth_rate> </year> </population> </country> </region> </demographics>
The result is shown in Figure 13-8.
While custom entities can be placed within XML data, parameter entities are placed only within DTD definitional structure. Parameters allow removal of repetition in DTD declarative scripting. For example, you can change the XML data and then adapt the DTD as shown in the script that follows. This script presents you with a lot of duplication:
<?xml version="1.0"?> <!DOCTYPE demographics [ <!ELEMENT demographics (region*)> <!ELEMENT region (country*)> <!ATTLIST region id CDATA #REQUIRED> <!ATTLIST region name CDATA #REQUIRED> <!ATTLIST region population CDATA #IMPLIED> <!ATTLIST region area CDATA #IMPLIED> <!ELEMENT country (city*)> <!ATTLIST country id CDATA #REQUIRED> <!ATTLIST country name CDATA #REQUIRED> <!ATTLIST country population CDATA #IMPLIED> <!ATTLIST country area CDATA #IMPLIED> <!ELEMENT city EMPTY> <!ATTLIST city id CDATA #REQUIRED> <!ATTLIST city name CDATA #REQUIRED> <!ATTLIST city population CDATA #IMPLIED> <!ATTLIST city area CDATA #IMPLIED> ]> <demographics> <region id="1" name="Africa" population="789548670" area="26780325"> <country id="1" name="Algeria" population="" area="2381741" code="AG"> <city id="307" name="Oran" population="1200000" area=""/> <city id="130" name="Algiers" population="4100000" area=""/> </country> </region> <region id="6" name="Europe" population="488674441" area="4583335"> <country id="65" name="Austria" population="" area="82730" code="AU"> <city id="205" name="Vienna" population="1875000" area=""/> </country> <country id="66" name="Belgium" population="" area="30230" code="BE"> <city id="315" name="Antwerp" population="1125000" area=""/> <city id="206" name="Brussels" population="1875000" area=""/> </country> </region> </demographics>
You can use parameters to remove DTD definitional duplication in the example that follows:
<?xml version="1.0"?> <!DOCTYPE demographics SYSTEM "fig1309.dtd"> <demographics> <region id="1" name="Africa" population="789548670" area="26780325"> <country id="1" name="Algeria" population="" area="2381741" code="AG"> <city id="307" name="Oran" population="1200000" area=""/> <city id="130" name="Algiers" population="4100000" area=""/> </country> </region> <region id="6" name="Europe" population="488674441" area="4583335"> <country id="65" name="Austria" population="" area="82730" code="AU"> <city id="205" name="Vienna" population="1875000" area=""/> </country> <country id="66" name="Belgium" population="" area="30230" code="BE"> <city id="315" name="Antwerp" population="1125000" area=""/> <city id="206" name="Brussels" population="1875000" area=""/> </country> </region> </demographics>
Additionally, an externally stored DTD file must be used because parameters cannot be interpreted when placed within the same file as the XML data. Otherwise, I get an error in Internet Explorer 6. The DTD file is shown here:
<!ENTITY % locationAttributes "id CDATA #REQUIRED name CDATA #REQUIRED population CDATA #IMPLIED area CDATA #IMPLIED"> <!ELEMENT demographics (region*)> <!ELEMENT region (country*)> <!ATTLIST region %locationAttributes;> <!ELEMENT country (city*)> <!ATTLIST country %locationAttributes;> <!ELEMENT city EMPTY> <!ATTLIST city %locationAttributes;>
The entity declaration must appear within the DTD file before it is actually referenced, as shown in the preceding example. The result of the preceding XML document and DTD file is shown in Figure 13-9.
That is how you create Document Type Definitions, or DTDs, in order to attempt to impose some restrictive structural requirements onto XML document data. A more advanced solution to this type of issue is that of XML Schemas.
| ||
| ||
|