Oracle Development Unleashed (3rd Edition)

Previous Table of Contents Next

Page 1348

The result of the data-extraction process is usually a set of flat files that is transferred to the data warehouse system. I have found that OLTP and legacy systems usually deal best with a column-delimited flat file format, where columns in the file correspond to various pieces of information. The alternative is some character-delimited format in which a semicolon separates various pieces of information, for example. Note that on some computer systems, you might have to convert the data storage format of the flat file to transfer it to your data warehouse system. One of the more common conversions is to translate the IBM mainframe EBCDIC format to the ASCII format used on UNIX systems. The following is an example of an output file that is column delimited:

PURCHASING 123.4534246.90 OPERATIONS 33.2056 66.40 . . .

This example simulates a download of expenditures by various departments. The departments occupy the first column of data (PURCHASING and OPERATIONS). The next item is some form of per-unit cost. The third item is some form of cost code. Notice that this column runs together with the numbers on either side of it. This is common in column-delimited flat files; therefore, you have to map out in writing the column assignments with the people performing the data extraction and the people importing this data. The final column is some form of burdened cost. You also should notice that column-delimited formats result in a number of blank spaces. This can be significant in some cases (you have a large text field that is typically only 20 percent full). These blank spaces are valid characters and take up transmission time and data storage space during the transfer to the data warehouse. You might want to consider some form of character-delimited format if these spaces are a problem on your system.

The aggregation and summarization processes are similar, because they move data from one set of tables to another in your Oracle database. This is a nice, controlled environment over

Page 1349

which you typically have complete design authority. As always, though, you'll need to deal with a few design considerations, including the following:

The following statement shows the basic format of the SQL scripts used in an aggregation script. This example takes weekly expense information for a department and uses a cross-reference table to aggregate the expenses into fiscal months:

Select week_month.month_number, weekly_expenses.department, sum(weekly_expenses.expense_amount) into monthly_expenses.month_number, monthly_expenses.department, monthly_expenses.expense_amount where weekly_expenses.week_number = week_month.week_number and monthly_expenses.month_number = week_month.month_number group by weekly_expenses.department,monthly_expenses.month_number;

Purge Processes

I already covered some of the basics on the purge processes. You will need to consider the possibility of archiving before executing the purge processes (tapes or even writeable CDs are relatively cheap these days). The purge process is another example of a process that is well-suited to the use of stored procedures. When using these processes, you typically are executing a series of SQL statements that are deleting records based on some values in the primary key. A few words of wisdom on these purge processes follow:

Page 1350

Star Join Schema

The star join schema is very popular in the data warehousing world. An example of this schema is a basic data table (referred to as a fact table) and its links to some of the basic information items that users are likely to form their queries against (dimension tables for store and time). This schema makes strong use of primary keys and foreign keys to ensure that you do not have records of information in your fact tables that cannot be referenced by the associated dimension tables.

STAR JOIN SCHEMA TIPS
The key to making the star join schema work is ensuring that you always run your queries against the dimension table first and then the fact table. This makes sense, because a dimension table may have only a few dozen rows, whereas the fact table may contain millions. Also, because the dimension tables make up part of the primary key or alternate indexes of the fact table, you will be performing an indexed search against the fact table if you have processed the dimension table records first. I have seen this method make the difference between queries that take hours to complete and those that take less than 10 seconds. You might have to use the hint option on your SQL statements to get the query to behave as you want it to, but it is worth the effort.

Another trick to make the star join schema perform well is to cache your dimension tables into memory whenever possible. Memory access is much faster than disk access. Because most of the queries in a star join schema database involve one or more of the dimension tables, you will be using these tables frequently. Because you also process a large number of rows of information, though, these tables may rotate off the most recently used list in your SGA. You have the option, however, of designating certain tables to be cached in memory, which keeps them at the top of the list in the SGA.

A final trick for working with star join schemas is to take advantage of multithreaded queries on multiprocessor computers when possible. Because most of the queries in the star join schema involve finding rows in one or more dimension tables and then finding corresponding rows in a fact table, you really have a series of processes executing. The Oracle Parallel Query option can take advantage of this feature merely by designating the degree of parallelism in your query (or on the table itself or using a default value for the database). This trick helps you take advantage of all that money you spent on a large multiprocessor server.

Previous Table of Contents Next

Категории