Using XML with Legacy Business Applications

2017-07-07 02:10:07

We're now ready to look at some more nuts-and-bolts logic and code. In this section we'll review the models for our main routines and examine some base classes from which we'll derive the classes used in implementation.

Main Routine Structures

All of our main routines will be similar, but they will fall into two distinct types. One type will convert non-XML formats to XML, and the other will convert XML to non-XML formats. Aside from this, we'll code the main processing logic as callable functions and the main routines that handle command line processing as basic shells . This strategy gives us the option to not only create stand-alone utilities for now but also to build later an integrated system that calls all the functions from one main routine.

Main Routine for Legacy Format to XML Conversion

Arguments: Input File Name Output Directory Name File Description Document Name Options: Validate Output Help Verify options and display help Create new SourceConverter, passing validation option, Output Directory Name, and File Description Document Name Call SourceConverter's processFile method, passing Input File Name Display completion message

Main Routine for XML to Legacy Format Conversion

Arguments: Input Directory Name Output File Name File Description Document Name Options: Validate Input Documents Help Verify options and display help Open Output File Create new TargetConverter, passing Output Stream and File Description Document Name Set up implementation specific DOM environment DO for all Input Documents in Input Directory New DOM Document <- Load, parse, and validate input document Call TargetConverter's processDocument method, passing DOM Document ENDDO Close Output Stream Display completion message with number of documents processed

This approach gives us the flexibility in an enhanced utility to call the Target Converter's processDocument method on a DOM Document in memory, perhaps after it has been created from another DOM Document via an XSLT transformation.

The main difference between the utilities for converting to and from XML is in the structure of the main shell routine. For converting to XML most of the work occurs in handling the command line arguments. In converting from XML we need to open the input directory, then loop through the list of files in the directory, loading and parsing each and passing the resulting DOM document to the processDocument method.

To implement the converters we'll develop a converter base class, with derived source and target converter classes. Each legacy format will derive a source and target class.

NOTE Chapter Conventions

When discussing the classes in this section, the arguments and return values listed for the various functions are logical only and are intended to convey the basic functionality. They will vary for the Java and C++ implementations . For example, if a method returns a string, the Java version may actually return a string whereas the C++ version may take a char * pointer as an argument, passing the output area by reference, and write the results to that area. In addition, if some of the methods return no value, the C++ implementation may return a status value. The Java version may throw an exception.

Also, I list here only the primary methods. Various set and get methods may be omitted.

Converter Base Class

Overview

The Converter class is the base class from which the source and target converters are derived. The bullets below summarize the primary attributes and methods.

Attributes:

DOM Document File Description Document

DOM Element Grammar Element

String Root Element Name

Methods:

Constructor

loadFileDescriptionDocument

Methods

Constructor

The base class Converter constructor performs only some very basic setup.

Logic for the Converter Constructor Method

Arguments: None Set up implementation dependent environment for loading and parsing the File Description Document

loadFileDescriptionDocument

This operation is coded as a separate function since the EDI converter may need to load more than one file description document as it processes an EDI interchange.

Logic for the Converter loadFileDescriptionDocument Method

Arguments: String File Description Document name Returns: Status value or throws exception DOM Document File Description Document <- Load, parse, and validate File Description Document from passed file name Grammar Element <- call File Description Document's getElementsByTagName for "Grammar" Root Element Name <- call Grammar Element's getAttribute on "ElementName"

SourceConverter Base Class (Extends Converter)

Overview

The SourceConverter, derived from the Converter class, is the base class for the CSV, flat file, and EDI source converter classes developed in the next three chapters. The lists below summarize the major attributes and methods.

Attributes:

DOM Document Output Document

RecordReader Reader

String Base Output Directory

String Schema Location URL

Boolean Validation Option

Methods:

Constructor

processFile (virtual)

processGroup

saveDocument

The processGroup method is listed here for completeness but will be developed in Chapter 8.

Methods

Constructor

The operations common to all our legacy format source converters include setting up the base output directory name and implementation-dependent DOM setup for output document validation.

Logic for the SourceConverter Constructor Method

Arguments: Boolean Validation Option String Base Output Directory Call base class constructor Validation Option <- Passed validation option Base Output Directory <- From passed base output directory, Appending directory separator character if required Perform any implementation specific setup for validating output documents before writing them

saveDocument

This operation is coded as a separate function since it may be called within a while loop that reads records from the input file and after exiting the while loop to save the last document.

Logic for the SourceConverter saveDocument Method

Arguments: DOM Document Output Document String Output File Name Boolean Validate Returns: Status value or throws exception IF Validate is true Validate output DOM Document ENDIF Save DOM Output Document to passed File Name, dependent on implementation

An enhanced version of this utility might give the option of looking up and calling an in-memory XSLT transformation on the DOM Document before it is saved to disk. We might even hand off the DOM Document in memory to a message handling service for e-commerce applications. We won't get that advanced in this book (KISS over efficiency), but the options certainly exist.

Derived Class Method

processFile

This method is declared as a virtual method in the base class and implemented separately in each of the derived classes for our legacy formats. The general processing model is to read a single file with multiple logical business documents, create a separate XML instance document from each logical document in the input file, and write a separate file for each of the XML instance documents. In an e-commerce type of situation where some of the documents are intended for different external organizations, we'll write the documents for each organization to a separate directory. The listing below shows the general logic.

Logic for the SourceConverter processFile Method

Arguments: String Input File Name Initialize Sequence Number Open Input File Call RecordReader's setInputStream method, passing the input Stream Record Length <- Call RecordReader's readRecord method DO while Record Length => 0 Record Grammar <- get Grammar for Record Type Call RecordReader's parseRecord method, passing Record Grammar Call RecordReader's toXMLType method IF Break on Partner ID Output Directory <- Base Output Directory + Partner ID IF New Partner Store new Partner in Partner Array Make Output Directory for Partner ENDIF ENDIF IF Break on Document IF Output DOM Document != null Call saveDocument, passing Output DOM Document and Output File Path Set Output DOM Document to null ENDIF Create new Output DOM Document Create Root Element and Append to Output DOM Document IF Schema Location URL is not null Create noNamespaceSchemaLocation Attribute and append to Root Element ENDIF Increment Sequence Number Pad Sequence Number to 3 digits with leading zeroes Output File Path <- Output Directory + Sequence Number + ".xml" Call RecordReader's setOutputDocument method, passing Output DOM Document ENDIF Call RecordReader's writeRecord method, passing Parent Element and Grammar Element for Record Record Length <- Call RecordReader's readRecord method ENDDO IF Output DOM Document != null Call saveDocument, passing Output DOM Document and Output File Path ENDIF Close input stream Display completion message with the number of documents processed

TargetConverter Base Class (Extends Converter)

Overview

The TargetConverter, derived from the Converter class, is the base class for the CSV, flat file, and EDI target converter classes developed in the next three chapters. The lists below summarize the major attributes and methods.

Attributes:

RecordWriter Object

Methods:

Constructor

processDocument (virtual)

processGroup

The processGroup method is listed here for completeness but will be developed in Chapter 8.

Method

Constructor

Similar to the SourceConverter's constructor, the TargetConverter's constructor is very basic.

Logic for the TargetConverter Constructor Method

Arguments: String File Description Document Name Call base class constructor Call loadFileDescriptionDocument from passed document

You will note that in this constructor, unlike the SourceConverter constructor, we load the file description document. This is because of differences in how we will process EDI files in Chapter 9. An EDI interchange that we read may have several different documents in it and may need several different file description documents. So, we don't want to load the file description document in the base class constructor. However, when we write an EDI interchange we'll just accept one document type as input. This is the same model we follow for CSV and flat files. We'll load the file description document in the base class target converter.

Derived Class Method

processDocument

As with the processFile method in the SourceConverter, this method is declared as a virtual method in the base class and implemented separately in each of the derived classes for our legacy formats. The general processing model is to read from multiple XML documents in multiple files and create a single output file in the legacy format. We'll also make our life somewhat easier by reading the files from a single directory rather than multiple directories. The listing below shows the general logic.

Logic for the TargetConverter processDocument Method

Arguments: DOM Document Input Document Root Element <- Call Input Document's getDocumentElement method Root Element Name <- Call Root Element's getTagName IF (Root Element Name != Root Element Name from Grammar Element) Return Error ENDIF Child Element <- Get Root Element's firstChild, skipping over non-Element Nodes DO for all Record Elements, starting with first child of Root Record Grammar Element <- Get Record Grammar Element from Document Grammar Element Call RecordWriter's parseRecord method, passing Record Element and Record Grammar Element Call RecordWriter's writeRecord method ENDDO

That's the basic setup. Pretty simple, huh? Simple is good. Simple is robust, maintainable , and extensible.

RecordHandler Base Class

Overview

This class provides basic support for handling records in non-XML-formatted files. It is the ultimate base class from which all of our various reader and writer classes are derived. It has a few basic data structures and utility methods.

Attributes:

Array of Pointers DataCell Array

Integer Highest Cell

DOM Document File Description Document

Byte or Character Array Record Buffer

Integer Record Buffer Length

Byte Record Terminator1

Byte Record Terminator2

Terminator1 is used for all record formats except fixed length flat files. It is used to store the primary record terminator character for variable length files and the segment terminator for EDI. Terminator2 is used to store the secondary record terminator character for variable length files. It usually has a value only if the physical record terminators are a carriage return and line feed pair.

Methods:

The methods defined in this class are very basic utility methods.

Constructor

createDataCell

getElementText

getFieldValue

setDelimiter

setTerminator

Methods

Constructor

The constructor function basically just sets up the DataCell Array and initializes the other class Attributes.

Logic for the RecordHandler Constructor Method

Arguments: DOM File Description Document Create DataCell Array Create Record Buffer Set Record Buffer Length to zero Highest Cell <- -1 Save Passed File Description Document

createDataCell

This method creates the DataCell derived class corresponding to the passed data type, increments Highest Cell, and loads the new cell into the next entry in the CellArray.

Logic for the RecordHandler createDataCell Method

Arguments: Integer Field Number DOM Element Field Grammar Returns: Pointer to DataCell Object Cell DataType <- call Field Grammar's getAttribute on DataType For each supported data type do IF block as follows IF (Cell DataType = Constant Code for the type) Create new DataCell derived class for the type ENDIF IF new DataCell is null Return error ENDIF Increment HighestCell; CellArray[HighestCell] <- New Cell return new DataCell

A brief word is in order regarding how we fill and use the DataCell Array. There are basically two approaches to how we could use the Array: (1) we could use an indexed approach in which the field number is used as a direct index into the DataCell Array, or (2) we could just fill the array from the bottom up.

In the indexed approach, we can use the field number (which we determine from parsing or from the grammar) as an index into the DataCell Array, loading the nth Array entry with the contents of the nth field. This approach has a certain elegance and makes some of the algorithms a bit simpler than the other approach. In the most efficient implementation we could use an indexed approach not only for accessing the DataCell Array but also for accessing the Grammar Elements that describe the fields in the legacy record format.

The other approach is to just fill the Array from the bottom up, starting at index zero, without trying to maintain any correspondence between the Array index and the field number. This approach means that when we need the field number we have to retrieve it from the DataCell object in the Array, which means a bit more code than the other approach.

Due to the fact the EDI has subfields (that is, subelements), with a single DataCell Array we can't use the indexed approach for all our formats. This requires a one-for-one, static correspondence between field (and subfield) numbers and Array entries. To implement it we would have to set up a secondary Array for each EDI data element that has subfields, with the associated pointers, and so on. This starts to get more complicated and clever than it is worth. So, since in pragmatic terms we are limited to the bottom up, sequential loading approach for EDI I'm going to use it for all the legacy formats. One consistent approach is easier to code and maintain in the long run than two or more.

NOTE Adding New Data Types

If you decide to add a new data type to the code, in addition to creating the derived DataCell class you need to edit this routine. Follow the comments in the Java or C++ code to add an IF test and create the DataCell derived class of your new type.

getElementText

This simple utility method gets the Text Node associated with the source Element, returning the Text Node's value. If an Element has only one Text Node as its child, and we can be sure there is content in the Node, we can probably do this with one line of Java or C++ code. However, we can't always be sure of this since some Elements, such as those with the schema language built-in data type of string, may have Comment Nodes as children. I'm providing a utility method with enough code in it to avoid the more common exceptions, such as an empty Element.

Logic for the RecordHandler getElementText Method

Arguments: DOM Element Source Element Returns: String Text Content Child Node <- get firstChild of Source Element DO while Child Node != null and Child Node Type != Text Node Child Node <- Child Node's nextSibling ENDO IF Child Node == null Return null ENDIF Return Child Node's getNodeValue

getFieldValue

This method searches the DataCell Array for the cell corresponding to the passed field number and returns the contents of the field when found. It throws an exception or returns an error status if the requested field is not found in the Array.

Logic for the RecordHandler getFieldValue Method

Arguments: Integer Field Number Returns: String Field Contents DO for all Cells in DataCell Array from 0 through Highest Cell IF Field Number = CellArray's getFieldNumber Return CellArray's getField ENDIF ENDDO Return error

setDelimiter

Although this routine involves only a few lines of code, the functions are performed frequently.

Logic for the RecordHandler setDelimiter Method

Arguments: String Delimiter Returns: Byte Delimiter IF length of Delimiter = 1 return Delimiter as byte ENDIF return Delimiter converted to byte from hex string

setTerminator

This method sets the Record Terminators used when reading and writing variable length records. It accepts a Record Terminator string as a U to set UNIX-style line feed (x0A) record termination, W for Windows-style carriage return and line feed pair (x0D x0A), another literal character, or a two-character hexadecimal value converted to a byte. We don't need to consider any other cases in the code because schema validation ensures we'll get only these values.

Logic for the RecordHandler setTerminator Method

Arguments: String Terminator Returns: Nothing IF length of Terminator = 1 DO CASE of Terminator 'U': Record Terminator1 = line feed BREAK: 'W': Record Terminator1 = carriage return Record Terminator2 = line feed BREAK: other: Record Terminator1 = Terminator ENDDO ELSE Record Terminator1 = Terminator converted from hex ENDIF

RecordReader Base Class (Extends RecordHandler)

Overview

This class provides basic support for reading records from non-XML-formatted files and converting them to XML. Many of the methods are used in the derived classes, though some are overridden. We will add methods as we build the various utilities.

Attributes:

Input Stream

DOM Document Output Document

Methods:

Constructor

getRecordType (virtual)

parseRecord (virtual)

readRecord (virtual)

readRecordVariableLength

setInputStream

setOutputDocument

toXMLType

writeRecord

Methods

Constructor

The RecordReader has two very basic constructor functions. The first takes a single argument of file description document name and doesn't do anything but call the base RecordHandler constructor. The second version also saves the Input Stream, which is the second argument to the constructor.

readRecordVariableLength

This method reads a physical record, as a variable length record, from the input file. A bit of discussion on the approach is in order. There are a few options for reading variable length records from input files.

Native Java or C++ readLine method : This approach is suitable for many purposes and is very simple to implement. However, it won't work for files that have a terminator other than a line feed or carriage return and line feed pair, and I have been bitten by it a few times. In cases where a Windows file has line feeds embedded in the record in addition to the carriage return and line feed terminators, the records may be split when they shouldn't be. The other two approaches don't have this problem because we look for the terminators ourselves .

Block read with position index : This approach involves reading a fixed length buffer from the input file and keeping an index of the current position in the buffer. When a new record is requested, we search forward in the buffer from the current position until we encounter the terminator(s), then copy that substring to the returned input record. We then set the current position index to the next position beyond the returned record. This method is very efficient, but buffer and index management can get a bit complicated when logical records span physical blocks, which is usually the case.

Single character read : This approach involves reading one character at a time from the input stream, building the returned input record until we encounter the terminator(s). This is not quite as simple as readLine, but it is simpler than the block read. The main disadvantage is that the single character read is not as efficient as the other methods. However, we can ease the pain quite a bit by doing buffered reads. The Java API gives us a way to do this nicely . For now, the C++ implementation uses the default buffering provided by Visual C++. This can be modified if performance requires it by using the filebuf class and the ifstream setbuf method. Buffering still doesn't help us any with the stack and call overhead of a method call to get each byte from the input file. However, since we're going for KISS over performance, this is the route we'll take.

The code for the readRecordVariableLength method, though a bit tedious , is fairly straightforward. The pseudocode below gives the general idea. The Java and C++ implementations vary a bit in the details due to the language differences.

Logic for the RecordReader readRecordVariableLength Method

Arguments: None Returns: Integer Record Length - Returns 0 at end of file Clear Record Buffer Record Buffer Length <- 0 IF Record Terminator2 = 0 DO until Record Terminator1 is read from file Input Byte <- get next byte from Input Stream (language dependent) Append Byte to Record Buffer Increment Record Buffer Length ENDDO ELSE DO until last two characters in Record Buffer are Record Terminator1 and Record Terminator2 Input Byte <- get next byte from Input Stream (language dependent) Append Byte to Record Buffer Increment Record Buffer Length ENDDO Clear last two bytes in Record Buffer Subtract 2 from Record Buffer Length ENDIF Return Record Buffer Length, or -1 if end of file

setInputStream

This method sets the class's Input Stream attribute to the passed input stream. We add this utility method to enable future enhancements such as processing all the files in an input directory rather than just a single input file.

setOutputDocument

This method sets the class's Output Document attribute to the passed value.

toXMLType

This method loops through the active entries in the DataCell Array and calls the toXML method of each.

Logic for the RecordReader toXMLType Method

Arguments: None Returns: Status or throws exception DO for all Cells in DataCell Array from 0 through Highest Cell Call DataCell Array entry's toXML method ENDDO

writeRecord

This method writes a parent record Element and child Field Elements with contents from the DataCell Array to the Output Document. We're including it in the base RecordReader class because both the CSVRecordReader and FlatRecord Reader use it.

Logic for the RecordReader writeRecord Method

Arguments: DOM Element Output Document Parent Element DOM Element Record Grammar Element Returns: Status or throws exception Element Name <- Call Grammar Element's getAttribute for "ElementName" Record Element <- call Output Document's createElement Parent Element <- call Parent's appendChild to append Record Element DO for all DataCells in array up through Highest Cell Call toElement on DataCell to create Element, load the text, and attach it to the parent Clear Cell Array Entry ENDDO Highest Cell <- -1

Derived Class Methods

All the derived classes implement the following methods, which are declared as virtual in the base class. These are the areas in which the processing differs the most among the various derived classes.

getRecordType

This method returns the record tag from the input record.

Logic for the RecordReader getRecordType Method

Arguments: None Returns: String Record Tag Get Record Tag field location information from File Description Document if required Parse record to locate field with the record tag Return field value

parseRecord

This method examines the input non-XML record and stores the field contents into DataCell objects of the appropriate derived class into the DataCell Array. Here is the general processing flow. The exact logic for each of the derived classes will, of course, vary since the record formats themselves are quite different.

Logic for the RecordReader parseRecord Method

Arguments: DOM Element Record Grammar Returns: Status or throws exception DO until end of input record Increment Field Number Get characteristics of next field by getting next Field Grammar child of Record Grammar Element Create DataCell derived class corresponding to field data type Copy field contents from Input Record to DataCell buffer Set Highest Cell to Field Number ENDDO

readRecord

This is a convenience method that provides a general interface to each derived class's routines for reading physical records. It is called from the base SourceConverter Class's processGroup method.

RecordWriter Base Class (Extends RecordHandler)

Overview

This class provides basic support for writing records from XML documents to non-XML-formatted files. As with the RecordReader, many of the methods are used in the derived classes, though some are overridden. Again, we will add methods as we build the various utilities.

Attributes:

Output Stream

Methods:

Constructor

parseRecord

writeRecord (virtual)

Methods

Constructor

Like the constructor for the RecordReader, the RecordWriter's constructor function doesn't do very much, either. The basic version of the constructor takes a single argument of the file description document, then passes it to the base class RecordHandler constructor. The other version takes a second argument of the Output Stream and saves it after calling the base class constructor.

parseRecord

This method retrieves the child field Element Nodes from a Record Element parent, storing the text contents in DataCell objects of the appropriate derived class in the DataCell Array. This base class method is used in both the CSVRecordWriter and FlatRecordWriter classes. A similar but somewhat more involved method is used in the EDIRecordWriter class.

Logic for the RecordWriter parseRecord Method

Arguments: DOM Element Record DOM Element Record Grammar Returns: Void Field Grammar <- Get first Grammar child Element from Record Grammar Element, skipping over non-Element nodes Grammar Field Name <- call Field Grammar's getAttribute on "ElementName" DO for all Field Elements that are children of Record Element ElementText <- call getElement Text on Field Element IF Element Text is empty Proceed to next field ENDIF Field Name <- Field Element's tagName attribute DO UNTIL Field Name == Grammar Field Name or Grammar Element is null Get next sibling Field Grammar Element, skipping over non-Element Nodes Grammar Field Name <- Field Element's tagName attribute ENDDO IF Grammar Element is null Return error ENDIF Field Number <- Get call Field Grammar's getAttribute on "Number" NewCell <- call createDataCell, passing Field Number and Grammar NewCell <- call NewCell's putField passing Element Text ENDDO

Derived Class Method

Again, the following method is declared as virtual in the base class; specific methods are defined in the derived classes. Each has a method that follows the general logic described here.

writeRecord

This method writes the contents of the DataCell Array to the output legacy non-XML record.

Logic for the RecordWriter writeRecord Method

Arguments: None Returns: Void DO for all DataCells in Array Call fromXML to convert field from XML data type Call prepareOutput to handle justification, filling, etc. Copy cell contents to Record Buffer Clear Array entry for cell ENDDO Append appropriate terminators to Record Buffer Call language's write routines to do physical write of Record Buffer

DataCell Base Class

Overview

This class is the base for all the classes we use to represent the various data types. With a few minor variations, for each specific data type that we support in a non-XML format we provide a derived class that handles converting that type to and from the corresponding Schema data type.

Attributes:

Byte or Character Array Cell Buffer

DOM Element Field Grammar

Integer Buffer Length

Integer Field Number

Integer Subfield Number

The Subfield Number attribute is used only for EDI, but we include it in the base class to make processing and class derivation a bit easier. This is a pragmatic approach, not a purist one!

Methods:

The most important methods are listed here. Other various minor methods set and get single class attributes or Attributes of the Grammar Element.

Constructor

fromXML (overridden in derived classes)

getField

prepareOutput (overridden in derived classes)

putByte

putField

toElement

toXML (overridden in derived classes)

trim (C++ only)

Methods

Constructor

The constructor method is very basic. It takes two arguments: the Field Number and the Field Grammar Element. It stores these passed arguments in the appropriate attributes of the class and initializes the other class attributes.

getField

This method has no arguments and returns the contents of the Cell Buffer and the value of the Buffer Length.

putByte

This method has a single argument of a byte of data. It appends the passed byte to the Cell Buffer.

putField

This method takes two arguments: a byte or char array of the contents of a legacy format field and the length of the field passed as an integer. It stores the passed field contents in the Cell Buffer.

toElement

This method creates a new Element using the passed name, adds text from the Field Contents, and attaches the Element to the passed parent Element. If the Cell Buffer contents are empty a new Element is not created.

Logic for the DataCell toElement Method

Arguments: DOM Element Parent DOM Document Output Document Returns: Void IF Buffer Length = 0 Return ENDIF Element Name <- Get "ElementName" Attribute from Grammar Element New Element <- call Output Document's createElement method Parent <- append New Element to Parent Text Node <- call Output Document's createTextNode method, Passing Cell Buffer contents New Element <- append Text Node to New Element

trim

This is a C++ function only since Java provides a trim method for its String class. It trims leading and trailing whitespace (all characters with an integer value less than or equal to the space character) from the Cell Buffer.

Derived Class Methods

As with the RecordHandler classes, these methods are overridden in the derived classes. Base versions with minimal functionality are provided here. None of these methods takes any arguments, and all either return a completion status or throw an exception.

fromXML

This method converts the buffer contents from the schema language data type format to the non-XML format. The base class version returns with no action and is appropriate for data types that require no conversion.

prepareOutput

After format conversion with fromXML, this routine performs additional formatting tasks that are required before the data can be written to the output. Specific operations are dependent on the data type. Typical tasks include space filling to the minimum field length, adding or removing leading zeroes, and truncating.

toXML

This method converts the buffer contents from the non-XML format to the format required by the corresponding schema language data type. This base class version returns with no action. It is appropriate for data types such as alphanumeric text or decimal numbers that require no conversion.