Oracle Development Unleashed (3rd Edition)

2017-07-07 02:10:07

Previous Table of Contents Next

Page 1348

OLTP system. Therefore, you might have to spend a bit of time wading through legacy code to figure out which codes map to which data values. If you have purchased a commercial OLTP system from your vendors , you might have quite a time figuring out the codes if the vendor does not provide a book showing how the data is stored.

As mentioned earlier, if you need to capture a fair amount of historical data, you might find inconsistencies in the way the data is stored. Therefore, never be satisfied with your testing until you have processed all the records that will be coming down to your system. This will save you a lot of problems when you actually are performing the conversion of data to production.

If you have purchased a commercial OLTP system, you need to be careful when implementing upgrades to that system. The vendor might choose to change the format of the data, and you need to reflect these changes in your data-extraction routines. Don't count on the vendor publishing these changes; vendors are used to people just using their products.

The result of the data-extraction process is usually a set of flat files that is transferred to the data warehouse system. I have found that OLTP and legacy systems usually deal best with a column-delimited flat file format, where columns in the file correspond to various pieces of information. The alternative is some character-delimited format in which a semicolon separates various pieces of information, for example. Note that on some computer systems, you might have to convert the data storage format of the flat file to transfer it to your data warehouse system. One of the more common conversions is to translate the IBM mainframe EBCDIC format to the ASCII format used on UNIX systems. The following is an example of an output file that is column delimited:

PURCHASING 123.4534246.90 OPERATIONS 33.2056 66.40 . . .

This example simulates a download of expenditures by various departments. The departments occupy the first column of data (PURCHASING and OPERATIONS). The next item is some form of per-unit cost. The third item is some form of cost code. Notice that this column runs together with the numbers on either side of it. This is common in column-delimited flat files; therefore, you have to map out in writing the column assignments with the people performing the data extraction and the people importing this data. The final column is some form of burdened cost. You also should notice that column-delimited formats result in a number of blank spaces. This can be significant in some cases (you have a large text field that is typically only 20 percent full). These blank spaces are valid characters and take up transmission time and data storage space during the transfer to the data warehouse. You might want to consider some form of character-delimited format if these spaces are a problem on your system.

The aggregation and summarization processes are similar, because they move data from one set of tables to another in your Oracle database. This is a nice, controlled environment over

Page 1349

which you typically have complete design authority. As always, though, you'll need to deal with a few design considerations, including the following:

Aggregation and summarization procedures are excellent candidates for the use of stored procedures. SQL statements are going to have to be used to select and insert the data into your Oracle tables anyway. Using stored procedures enables these SQL statements to be stored in a pre-parsed format, which reduces the time necessary to execute the statements. Also, there are typically not a lot of data manipulation statements, which is one of PL/SQL's weakest features from a performance perspective.

You can use the SQL summation functions, which typically are faster than manually summing the information in local variables .

You might need to have a set of tables to convert dates into the appropriate periods in your business calendar. These often are tables that you manually enter into the data warehouse system. Examples of these tables include which days belong to which weeks in your calendar or which weeks belong to which pay periods.

The following statement shows the basic format of the SQL scripts used in an aggregation script. This example takes weekly expense information for a department and uses a cross-reference table to aggregate the expenses into fiscal months:

Select week_month.month_number, weekly_expenses.department, sum(weekly_expenses.expense_amount) into monthly_expenses.month_number, monthly_expenses.department, monthly_expenses.expense_amount where weekly_expenses.week_number = week_month.week_number and monthly_expenses.month_number = week_month.month_number group by weekly_expenses.department,monthly_expenses.month_number;

Purge Processes

I already covered some of the basics on the purge processes. You will need to consider the possibility of archiving before executing the purge processes (tapes or even writeable CDs are relatively cheap these days). The purge process is another example of a process that is well-suited to the use of stored procedures. When using these processes, you typically are executing a series of SQL statements that are deleting records based on some values in the primary key. A few words of wisdom on these purge processes follow:

As with a select statement, issue your delete statements in a way that takes advantage of the primary key or other indexes. This will greatly speed the delete process.

Be extremely careful when you form delete statements to make sure that you delete only the records of interest. Suppose that there is a value in one column that is always a certain value. Perhaps you do not want to include it in your where clause. You still should consider putting it in as part of the where clause if there is the possibility that this value might be changed in the future.

Page 1350

Commit your deletions after every delete statement if you are processing a large number of records. You can run into problems if you have a large number of records pending in a rollback segment and you run out of space in that rollback segment.

Star Join Schema

The star join schema is very popular in the data warehousing world. An example of this schema is a basic data table (referred to as a fact table) and its links to some of the basic information items that users are likely to form their queries against (dimension tables for store and time). This schema makes strong use of primary keys and foreign keys to ensure that you do not have records of information in your fact tables that cannot be referenced by the associated dimension tables.

STAR JOIN SCHEMA TIPS

The key to making the star join schema work is ensuring that you always run your queries against the dimension table first and then the fact table. This makes sense, because a dimension table may have only a few dozen rows, whereas the fact table may contain millions. Also, because the dimension tables make up part of the primary key or alternate indexes of the fact table, you will be performing an indexed search against the fact table if you have processed the dimension table records first. I have seen this method make the difference between queries that take hours to complete and those that take less than 10 seconds. You might have to use the hint option on your SQL statements to get the query to behave as you want it to, but it is worth the effort.

Another trick to make the star join schema perform well is to cache your dimension tables into memory whenever possible. Memory access is much faster than disk access. Because most of the queries in a star join schema database involve one or more of the dimension tables, you will be using these tables frequently. Because you also process a large number of rows of information, though, these tables may rotate off the most recently used list in your SGA. You have the option, however, of designating certain tables to be cached in memory, which keeps them at the top of the list in the SGA.

A final trick for working with star join schemas is to take advantage of multithreaded queries on multiprocessor computers when possible. Because most of the queries in the star join schema involve finding rows in one or more dimension tables and then finding corresponding rows in a fact table, you really have a series of processes executing. The Oracle Parallel Query option can take advantage of this feature merely by designating the degree of parallelism in your query (or on the table itself or using a default value for the database). This trick helps you take advantage of all that money you spent on a large multiprocessor server.

Previous Table of Contents Next

Purge Processes

Star Join Schema

Категории