The MicrosoftВ Data Warehouse Toolkit: With SQL ServerВ 2005 and the MicrosoftВ Business Intelligence Toolset

Loading the DW/BI System in Real Time

Its time to discuss how to load the DW/BI system in real time. The problems with loading the relational data warehouse database are design issues, not technology issues. Theres no technical reason that you couldnt trickle-feed your relational data warehouse database every hour , even every minute, possibly even faster. Integration Services has lots of features that make this possible, even easy. The problem, as weve already discussed, is meeting a diverse set of requirements with a single integrated system.

The Integrated Approach

The most intellectually appealing solution is to integrate the real-time data into the DW/BI system. This means processing dimension changes in real time: updating Type 1 attributes in place for all conformed dimensions, and correctly handling all Type 2 attribute changes for all conformed dimensions.

In this approach, the Integration Services packages are very similar to what you may already have designed for daily processing. In the simplest case, youd simply run those packages more often. This assertion is over-simplifying the situation, but the main point is that the real-time packages are not wildly different from the packages for daily processing. Here are some differences, and issues to watch out for:

The Real-Time Layer

If you need to integrate and transform data in real time, you should use Integration Services. We recommend that you set up a separate database to hold the real-time layer. Use Integration Services to populate this real-time database intraday. The database should store data for the current day only.

You can call this real-time database whatever you wantexcept please , dont call it a data warehouse! Operational data store may be the best term, although this term (like data mart ) has been used and misused so often that it triggers a violent reaction in some people. Well stick with the less controversial term real-time layer.

Even if you dont have a compelling need to query integrated data in real time, you may still develop a real-time layer. The database we describe in this section is very similar to the relational data warehouse database. By populating it in real time, you can spread much of the ETL burden over 24 hours, and reduce the time required to perform the DW/BI systems daily update.

WHY USE BUSINESS KEYS IN THE REAL-TIME LAYER?

The reason we recommend that the real-time layer use business keys instead of surrogate keys is to avoid updating the enterprise conformed dimensions throughout the day. As we described previously, there are two reasons you would want to avoid updating the conformed dimensions:

 

Data Model for the Real-Time Layer

The real-time layer should contain tables that are structured similarly to the relational data warehouse database, except that these tables do not contain surrogate keys. Start with the fact tables that you plan to track in real time. These fact tables should look just like the corresponding fact table in the data warehouse database, with the same columns in the same order. The exceptions are vital :

Use a similar approach for the design of the dimension tables in the real-time database. First, you may want to include only those dimensions that are used by the facts youre tracking in real time. Create those dimension tables so they look just like the main dimension tables, but without the surrogate primary key or SCD-2 tracking columns like RowStartDate. Make sure you keep a datetime stamp for when the dimension member was added or changed.

Tip 

Almost all real-time fact tables are transaction-grain facts. Snapshot facts, and particularly accumulating snapshot fact tables, are nearly impossible to maintain in real time. Stick to the underlying transaction grain.

Processing the Real-Time Data

In this section we briefly describe how to process the real-time data. There are four steps:

  1. Start the day right. Start each day by moving the real-time data you collected yesterday into the data warehouse database and emptying out the tables in the real-time database. The easiest way to do this is to have two copies of the real-time database structure, and flip-flop between them at midnight.

  2. Collect new rows and updates. As transactions occur in the source system, you need to collect them, run them through Integration Services packages, and write them to the real-time database. Use the techniques we described earlier in this chapter for running a package frequently or continuously. Collect all new facts, new dimension members, and changes to dimension members. For dimensions, continuously update all attributes during the day in the real-time database, without worrying about whether those attributes are Type1 or Type2.

    Tip 

    If business users are querying the real-time database, you need to maintain referential integrity between facts and dimensions. This means youd need to institute complex logic for grabbing a dimension member from the data warehouse database, if you havent seen a change for that dimension member yet today. You would do better to copy the entire start-of-day dimension to the real-time database before you begin working on todays transactions. Make sure the real-time copy of the dimension has enough information that you can easily find any rows youve touched today.

  3. Perform end-of-day processing. At the end of the day, the real-time database contains all the new dimension members that were added today, as well as the current end-of-day state of any dimension member that experienced an update to an attribute. All new facts are stored in the fact table. All these tables have their business keys, but other than that theyre clean and happy. In most cases, end-of-day processing means processing the dimensions to integrate the changes into the data warehouse dimension table. The end-of-day processing package for a dimension consists of little more than the slowly changing dimension transform that we described in Chapter 6. Similarly, the end-of-day processing package for a fact table consists of the surrogate key lookup pipeline from Chapter 6. This processing should be fast.

  4. Clean up. Its a good idea to back up all these transactions. The easiest way to do this is to back up the entire database, which after all contains only todays data. Then clear out all the data, because youll be using this database tomorrow.

Querying the Real-Time Layer

Youve populated your real-time database with todays data. As the day goes on, the database is updatedperhaps hourly, perhaps more often. How will users query that data?

The vast majority of use of real-time data is through reports and applications rather than ad hoc queries. Most analysts arent very interested in real-time data. Most people who are excited about real-time data are in operational roles, which means they are report consumers.

A report on only todays data is straightforward. The problem comes when you need to combine todays data with some historical data. The reason this is a little challenging is that the real-time database has business keys, and the data warehouse database uses surrogate keys.

Your team needs to define reports and predefined queries that stitch together the real-time database and the data warehouse database. This may be as simple as a query that UNION s today with history, or you may need to join the result sets. Todays data is always queried with the current set of attributes. Depending on the reports business requirements, the historical data will include either the current attributes from the real-time partition or the historical attributes from the data warehouse. Given the nature of operational information needs, its more likely that youll use the current image of the dimension for the entire report.

If the real-time database is on the same server as the data warehouse database, a SQL expert can usually write a SQL statement to perform this integration. If the databases are on different servers, you could use a SQL Server distributed query. Or, try sourcing the report from an Integration Services package, as we discussed previously in this chapter.

Категории