Beginning XML Databases (Wrox Beginning Guides)
| ||
| ||
|
XML Schemas are used to define logical structure onto XML data, much like defining a mapping between XML data and relational table structures in a relational database. In addition, XML Schemas are also used in hand with other XML technologies, including XPath 2.0, XQuery, and even things such as SOAP.
You might have noticed that DTD definitions are not exactly XML coding. XSD is written in XML. Again, XML is universally understood regardless of platform and environment; at least that is the intention for XML. So, XSD is capable of deeper structural description of XML data, but DTDs allow embedding of DTD definitions into an XML document; DTDs also allow entity functionality. XSDs must be in a file separate to that of XML data. DTD entity functionality can be useful but its not that much to write home about. XSDs also enforce strict typing, which allows for more accurate constraint value mapping of data.
If you remember, Chapter 6 included a brief introduction to small parts of the XML Schema and creating XSD scripts, when I covered SQL Server Database. In Chapter 6, you saw how an XSD script can be used to create a mapping between XML data and one or more related tables, in a relational database model of tables. The intention in Chapter 6 is a purely relational database-centric approach, and particularly for the SQL Server database. This chapter takes a more generic and perhaps native XML database approach to XSD. Any repetition between this chapter and Chapter 6 is intentional and necessary to facilitate explanation to all levels of expertise.
So, the most basic tag is the schema tag, demonstrated in this example script snippet:
<?xml version="1.0"> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> ... </xsd:schema>
Global and Local Types
Before you examine the various parameters for the <element> tag in detail, lets examine the definition of global and local types. Like any other programming language, a global type is globally accessible (in this case throughout an XML document, the XSD script file). A local type applies to the contents of the element (the elements with the element concerned ). A global type is one that is declared as a child of the <schema> tag, and not a non-direct child descendant of the <schema> tag.
A global type essentially allows later access within the XSD script, regardless of the location within the XSD script, or the point within the location of the XML hierarchy of the XSD script file. The <region> element in this XSD example is globally accessible and the <country> element is only locally accessible:
<?xml version="1.0"?> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name="region"> <xsd:element name="country"> ... </xsd:element> </xsd:element> </xsd:schema>
Basic XSD Structures
The basic XSD structures are the <element> and the <attribute> tags. As shown in the previous example, you can define elements using the <element> tag. The XSD syntax for the <element> tag is as follows :
<element name="" type="" ref="" form="" minOccurs="" maxOccurs="" default="" fixed= "">
The name attribute is obviously the name of the tag. The type attribute is the data type of the element (data types are discussed shortly). The ref attribute is a reference to another XSD definition (again to be discussed later on in this chapter). The minOccurs and maxOccurs attributes determine cardinality (how many occurrences of). The default attribute sets a default value when nothing is entered. The fixed attribute requires a fixed value regardless of what is entered.
The XSD syntax for defining an attribute is as follows:
<attribute name="" type="" ref="" form="" use="" default="" fixed="">
As you can see, there are some differences between element and attribute syntax definitions. The use="" attribute of an attribute definition is either optional, required, or prohibited . The default value is optional as most attributes are generally optional in nature. A required attribute must be present and a prohibited attribute is not allowed to be included.
There is nothing difficult to understand with respect to the definitions of XSD elements and attributes. However, a few of the details (some of their attributes) will be covered as you read through this chapter, at appropriate points on the learning curve. For example, basic data types apply to elements and attributes.
XML Schema Data Types
XPath, XQuery, XForms, and XML Schemas all share the same basic data types. Because of this, coverage of these data types has been left until this chapter, where it makes the most sense. Additionally, at this point in this chapter, basic data types are essentially applicable to each element in an XSD script file. These data types are very similar to the data types applied to the fields of tables in a relational database.
There are numerous other basic data types in addition to those mentioned in this section, used to implement an innumerate quantity of specialized capabilities. These data types include those shown in Figure 10-15, back in Chapter 10. Going into the nitty-gritty details of all the different variations of each data type is a little too detailed and advanced for this book. In this case, I will stick to the basic strings, numbers , dates, times, and miscellaneously categorized basic data types.
XML Schema String Data Types
Obviously, string data types can contain string values. You can define a value to be a string data type, as in the following example:
<xsd:element name="region" type="xsd:string"/>
A string data type ( xsd:string ) preserves white space characters in the string value, including characters such as new lines, tabs, and space characters . So, the following element will look as it stands here:
<region> North America </region>
A normalized string data type removes white space characters:
<xsd:element name="region" type="xsd:normalizedString"/>
And thus the preceding example region element would look like this in the output even if space characters are included in the XML document:
<region>North America</region>
XML Schema Numeric Data Types
Standard numeric data types in XSD are more or less the same as in any strictly typed programming language or database engine. Figure 10-15 shows numeric data types of int , long , negativeInteger , nonNegativeInteger , nonPositiveInteger , positiveInteger() , short() , unsignedInt() , unsignedInt() , unsignedLong() , unsignedShort() , decimal() , double() , and float() . Mathematically speaking, and from the perspective of computer programming, these data types are all self-explanatory.
For example, the following declares a number to be an integer. An integer is a whole number:
<xsd:element name="population" type="xsd:integer"/>
This would be a valid entry for an integer:
<population>2000000</population>
And this would be an invalid entry for an integer because a whole number does not have decimals:
<population>2000000.54</population>
A more fitting definition for the preceding real number (a number containing a decimal value) is as follows:
<xsd:element name="population" type="xsd:decimal"/>
XML Schema Date and Time Data Types
Date and time data types can be used to restrict values to contain easily understandable date and time values. A simple date data type can be defined as follows:
<xsd:element name="entryDate" type="xsd:date"/>
The default format for a date is of the form YYYY-MM-DD, indicating a four-digit year, a month, and a day. So, the following is a valid date entry:
<entryDate>2006-06-10</entryDate>
You can also specify that dates have times zones, times, dates including times and durations of time in all sorts of forms such as years , months, and so on. For example, a datetime value can be specified as follows:
<xsd:element name="entryDateTime" type="xsd:dateTime"/>
The default format for datetime values is YYYY-MM-DDThh:mm:ss, indicating years, months, days, hours, minutes, and seconds. For example, this value indicates 10:41 a.m. and 1 second, on the morning of June 10, 2006:
<entryDateTime>2006-06-10T10:41:01</entryDateTime>
XML Schema Miscellaneous Data Types
A miscellaneous data type in any computer technical text is often a method of lumping in things that dont really fit too well anywhere else. For example, consider a Boolean data type, which can be true (or 1), or false (or 0):
<xsd:element name="country" type="xsd:string"/> <xsd:attribute name="languages" type="xsd:boolean"/> <xsd:attribute name="occupations" type="xsd:boolean"/>
So, a country in the demographics database containing at least one <languages> element and no <occupations> elements within its <population> element subtree can be represented by the following <country> element:
<country id="1" code="AG" name="Algeria" languages="true" occupations="false">
Another miscellaneous data type is the anyURI data type, representing a Universal Resource Indicator (URI), or a web page address on the Internet. So, you can add a URI to the previous country definition like this:
<xsd:attribute name="webpage" type="xsd:anyURI"/>
And then change the country of Algeria, as follows:
<country id="1" code="AG" name="Algeria" webpage="http://www.algeria.com" languages ="true" occupations="false"/>
Cardinality
You may remember the minOccurs and maxOccurs attributes of the XSD <element> tag from the beginning of this section. These attributes determine cardinality. Cardinality determines how many times a specific item can occur, in this case an element or attribute.
Cardinality cannot be defined for global declarations because global declarations define a type (in the relational world), or a class (in the object world). A type is instantiated as a variable and a class as an object. In other words, there is no such thing as multiple iterations of a type of a class, but there are multiple iterations of a variable (declare many variables of the same data type), or an object (create many objects of the same class). So, in the case of an XSD definition, you can iterate the referenced object, but not the definition of that reference.
The possible values for cardinality are as follows:
-
minOccurs : A positive integer only, which is a value from 0 upward. 0 is a positive integer. The default value is 1.
-
maxOccurs : A positive integer or unbounded (implies an infinite integer). The default value is 1. maxOccurs must be greater than or equal to minOccurs .
In the following example, between 1 and 20 regions are allowed in a single demographics native XML database document (it can get large). And countries within each region can be none, or any number:
<xsd:element name="region" minOccurs="1" maxOccurs="20"/> <xsd:element name="country" minOccurs="0" maxOccurs="unbounded"/>
That is what cardinality means!
Element Ordering Sequence and Choice
Sequencing and choices implies a number of elements, contained within some other collection definition, must appear in a specific order (a sequence). Or there is a choice of various elements.
The syntax of the <sequence> element is as follows, implying that sequences have cardinality and that they can be repeated:
<sequence minOccurs="" maxOccurs"">
Here is an example sequence definition:
<xsd:element name="region" minOccurs="1" maxOccurs="20"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="population" type="xsd:integer"/> <xsd:element name="area" type="xsd:integer"/> <xsd:element name="country" minOccurs="0" maxOccurs="unbounded"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="population" type="xsd:integer"/> <xsd:element name="area" type="xsd:integer"/> </xsd:sequence> </xsd:element> </xsd:sequence> </xsd:element>
If you use the preceding definition, both regions and countries can contain all of the name, population, and area values; but they must occur in the sequence of name, population, and area. Additionally, the region has one final element for countries within regions. In other words, this is a valid <region> tag:
<region> <name>Africa</name> <population>789548670</population> <area>26780325</area> <country> ... </country> </region>
And this is not a valid <country> tag because the <area> and <population> tags are in the wrong sequence:
<region> <name>Africa</name> <area>26780325</area> <population>789548670</population> <country> ... </country> </region>
There is an <all> sequencing element definition, allowing any order of elements. This is the same as not using a <sequence> element.
Where the <sequence> element enforces element order, the <choice> element allows a selection from a list of elements. Again, the syntax is similar to that of the <sequence> element for the same reason, cardinality:
<choice minOccurs="" maxOccurs"">
Here is an example choice definition, sensibly allowing a single population entry in various different formats (including all would be a little pointless because they can all be recalculated):
<xsd:element name="region" minOccurs="1" maxOccurs="20"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="population" type="xsd:integer"/> <xsd:choice> <xsd:element name="populationInHundreds" type="xsd:integer"/> <xsd:element name="populationInThousands" type="xsd:integer"/> <xsd:element name="populationInMillions" type="xsd:integer"/> </xsd:choice> <xsd:element name="area" type="xsd:integer"/> <xsd:element name="country" minOccurs="0" maxOccurs="unbounded"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="population" type="xsd:integer"/> <xsd:element name="area" type="xsd:integer"/> </xsd:sequence> </xsd:element> </xsd:sequence> </xsd:element>
Custom Data Types
A custom data type allows a programmer to create user -defined data types. Both simple and complex data types in XSD allow for the creation of custom data types. A simple data type is a predefined data type, such as a string or an integer, but with some refinement. A complex data type is a custom-built data type consisting of multiple elements, much like a C programming struct command, defining a new table in a relational database, or even creating a new class in an object development environment.
Lets begin with simple data types.
Simple Data Types
A simple data type is a predefined data type, such as a string or an integer, but with some added refinement. The syntax for a simple data type is as follows:
<simpleType name="" final="">
A simpleType declaration is always based on a data type that already exists (a built-in XSD data type or another user-defined type). There are three types of refinements that can be made to an existing data type using simpleType : a restriction, a list, or a union.
Restricting Simple Types
Restriction simpleType definitions use the <restriction> element. This option restricts a data type to have a more restrictive definition such as a more restrictive range of values. Restrictions use what are called simple type facets . A facet is essentially a restriction. You can also create user-defined facets to apply to a simpleType . Lets begin with this XML fragment from the demographics XML document database:
<population> <year year="2003" population_id="410" population="31713719"/> <year year="2004" population_id="411" population="32129324"/> <year year="2005" population_id="412" population="32531853"/> <year year="2006" population_id="413" population="32930091"/> <year year="2010" population_id="417" population="34554588"/> <year year="2020" population_id="427" population="38555436"/> </population>
The basic syntax for a simpleType restriction is as follows:
<simpleType> <restriction base="<datatype type>"> ... application of facets ... </restriction> </simpleType>
The base attribute is the simpleType being derived from, which can be a basic data type, or even another user-defined type. For example, <xsd:restriction base="xsd:string"> .
Facets as applied to a simpleType , as a restriction, are what make up the refinement or redefinition of a simpleType definition. The various facets available for use are shown in Figure 13-10. A facet is what in a relational database would be called a constraint.
In the following example, an attribute called year is restricted to integer values of anywhere between the years 2003 and 2020, such that the year 2002 is excluded from the permissible range of values. Also, the year value must be exactly four digits long (a valid Y2K year representation):
<xsd:attribute name="year"> <xsd:simpleType> <xsd:restriction base="xsd:integer"> <xsd:minExlcusive value=2002/> <xsd:maxInclusive value=2020/> <xsd:totalDigits value=4/> </xsd:restriction> </xsd:simpleType> </xsd:attribute>
An enumerated list is a list of permissible values. For example, you can further restrict the years by changing the code as shown in the example that follows, where years are restricted to 2003, 2004, 2005, 2006, 2010, and 2020:
<xsd:attribute name="year"> <xsd:simpleType> <xsd:restriction base="xsd:integer"> <xsd:enumeration value=2003/> <xsd:enumeration value=2004/> <xsd:enumeration value=2005/> <xsd:enumeration value=2006/> <xsd:enumeration value=2010/> <xsd:enumeration value=2020/> <xsd:totalDigits value=4/> </xsd:restriction> </xsd:simpleType> </xsd:attribute>
Here is another interesting example that can be used to restrict the entry value for a telephone number, restricting a string to a telephone number format, as in (123) 456-7890:
<xsd:element name="region" type="xs:string"> <restriction base="string"> <length value="10"/> <minLength value="1"/> <maxLength value="10"/> <pattern value=" ([0-9]{3}) [0-9]{3}-[0-9]{4}"/> </restriction> </xsd:element>
Simple Type List Declarations
The <list> element is used to create a whitespace-separated list of items. In effect, this approach can be used to create a list of multiple value, or perhaps a list of multiple options. Lets say you wanted to change the demographics data to include a new <yearSummary> element, which summed up all the years of 2003, 2004, 2005, 2006, 2010, and 2020:
<xsd:attribute name="yearSummary"> <xsd:simpleType> <xsd:list itemType="xsd:nonNegativeInteger"/> </xsd:simpleType> </xsd:attribute>
The result would be something like the following:
<population> <yearSummary>2003 2004 2005 2006 2010 2020</yearSummary> <year year="2003" population_id="410" population="31713719"/> <year year="2004" population_id="411" population="32129324"/> <year year="2005" population_id="412" population="32531853"/> <year year="2006" population_id="413" population="32930091"/> <year year="2010" population_id="417" population="34554588"/> <year year="2020" population_id="427" population="38555436"/> </population>
Union List Declarations
A <union> element can be used to create two or more types, allowing verification against the contents of two type definitions at the same time. So, lets say hypothetically that the previous examples were created a little differently, where restrictions were created in two different simpleType definitions:
<xsd:attribute name="yearRange"> <xsd:simpleType> <xsd:restriction base="xsd:integer"> <xsd:minExlcusive value=2002/> <xsd:maxInclusive value=2020/> </xsd:restriction> </xsd:simpleType> </xsd:attribute>
And here is the second definition:
<xsd:attribute name="yearLength"> <xsd:simpleType> <xsd:restriction base="xsd:integer"> <xsd:totalDigits value=4/> </xsd:restriction> </xsd:simpleType> </xsd:attribute>
So, you can merge the preceding two definitions something like this:
<xsd:attribute name="yearRestricted"> <xsd:simpleType> <xsd:union memberTypes="yearRange yearLength"/> </xsd:simpleType> </xsd:attribute>
The result would be a validation of a combination of the years, ensuring that any years in the data had four digits in them.
Complex Data Types
A complex data type typically involves a structure, consisting of more than one element. Subsequently, that complexType can be accessed as a single item that contains subset elements. Here is a simple example, combining elements for each region into a single type structure:
<xsd:element name="region"> <xsd:complexType> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="population" type="xsd:integer"/> <xsd:element name="area" type="xsd:integer"/> </xsd:sequence> </xsd:complexType> </xsd:element>
Substitution
The < group > element can be used to create a structure, which can then be referred to later on as a reference. The result is a form of a substitution, as shown in the following example. This is the grouping:
<xsd:group name="regionGroup"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="population" type="xsd:integer"/> <xsd:element name="area" type="xsd:integer"/> </xsd:sequence> </xsd:group>
And this script snippet refers to the declaration of the <group> element:
<xsd:element name="region"> <xsd:complexType> <xsd:group ref="regionGroup"/> <xsd:element name="malePopulation" type="xsd:integer"/> <xsd:element name="femalePopulation" type="xsd:integer"/> </xsd:complexType> </xsd:element>
The <group> element allows specification and later reference to groups of elements. The <attributeGroup> element does the same but for attributes. So, I can create an attribute group something like this:
<xsd:attributeGroup name="regionAttributes"> <xsd:attribute name="region" type="xsd:string"/> <xsd:attribute name="population" type="xsd:string"/> <xsd:attribute name="area" type="xsd:string"/> </xsd:attributeGroup>
And this is how the group of attributes can be referenced later on:
<xsd:element name="region"> <xsd:complexType> <xsd:element name="region"/> <xsd:attributeGroup ref="regionAttributes"/> </xsd:complexType> </xsd:element>
This chapter has attempted to demonstrate how a mapping between XML document data and a relational database can be achieved. Document Type Definition (DTD) is a little less up-to-date than XML Schema (XSD).
| ||
| ||
|