The MicrosoftВ Data Warehouse Toolkit: With SQL ServerВ 2005 and the MicrosoftВ Business Intelligence Toolset

2017-07-07 02:10:07

Loading the DW/BI System in Real Time

Its time to discuss how to load the DW/BI system in real time. The problems with loading the relational data warehouse database are design issues, not technology issues. Theres no technical reason that you couldnt trickle-feed your relational data warehouse database every hour , even every minute, possibly even faster. Integration Services has lots of features that make this possible, even easy. The problem, as weve already discussed, is meeting a diverse set of requirements with a single integrated system.

The Integrated Approach

The most intellectually appealing solution is to integrate the real-time data into the DW/BI system. This means processing dimension changes in real time: updating Type 1 attributes in place for all conformed dimensions, and correctly handling all Type 2 attribute changes for all conformed dimensions.

In this approach, the Integration Services packages are very similar to what you may already have designed for daily processing. In the simplest case, youd simply run those packages more often. This assertion is over-simplifying the situation, but the main point is that the real-time packages are not wildly different from the packages for daily processing. Here are some differences, and issues to watch out for:

The source data must be limited to changed rows. Sometimes, with daily processing, you may pull all dimension rowsand occasionally all fact rowsand use Integration Services to identify changes. You would implement this approach only if you have a poorly designed source system that cant deliver a set of changed rows. This technique, obviously, doesnt scale well in real time. If you cant get a feed of only changed rows for medium- sized dimensions and facts, real-time processing is infeasible.

With real-time feeds, youre far more likely to encounter referential integrity violationsfacts that refer to dimension members that dont exist in your dimension tablethan during daily processing. You should design the fact table surrogate key lookup process to handle this event gracefully, as discussed in Chapter 6.

Partitioned tables arent as useful as you might think for improving processing performance. Unless youre loading into an empty partition, you cant use fast bulk loading. You might use partitions to support hourly micro-batch processing with fast loading. But creating a new partition for loading every minute seems unworkable. We discussed partitioned tables in Chapter 4.

LOCATION OF THE REAL-TIME LAYER
Its usually best to put the current database on the same server as the historical data warehouse database. Presumably, your users will need to combine current and historical data. Although distributed queries are feasible , its easier and works better if all the data is on the same server.

For micro-batch intraday processing, use DTExec or SQL Agent to launch the master package on a schedule (like hourly or every 15 minutes).

You cant set up Integration Services to feed the DW/BI system continuously. But you can effectively simulate a continuous feed by embedding the Data Flow task in a WHILE loop that never ends (for example, WHILE 1=1 ). You should monitor the package execution so it can be restarted automatically, should the package actually terminate for any reason.

Evaluate whether it makes sense to source from BizTalk or the SQL Server Service Broker queue, or Microsoft Management Queue (MSMQ).

Educate your business users about how and why the results of their queries might change from one minute to the next . Be prepared for power users to grab a chunk of data and download it to their PCs, where they can Excel it to death.

The Real-Time Layer

If you need to integrate and transform data in real time, you should use Integration Services. We recommend that you set up a separate database to hold the real-time layer. Use Integration Services to populate this real-time database intraday. The database should store data for the current day only.

You can call this real-time database whatever you wantexcept please , dont call it a data warehouse! Operational data store may be the best term, although this term (like data mart ) has been used and misused so often that it triggers a violent reaction in some people. Well stick with the less controversial term real-time layer.

Even if you dont have a compelling need to query integrated data in real time, you may still develop a real-time layer. The database we describe in this section is very similar to the relational data warehouse database. By populating it in real time, you can spread much of the ETL burden over 24 hours, and reduce the time required to perform the DW/BI systems daily update.

WHY USE BUSINESS KEYS IN THE REAL-TIME LAYER?

The reason we recommend that the real-time layer use business keys instead of surrogate keys is to avoid updating the enterprise conformed dimensions throughout the day. As we described previously, there are two reasons you would want to avoid updating the conformed dimensions:

Many business users dont want to see intraday changes to dimension attributes.

Any Type-1 (update in place) to a dimension attribute invalidates all precomputed aggregations that involve that attribute, for all fact tables, for all history.

Data Model for the Real-Time Layer

The real-time layer should contain tables that are structured similarly to the relational data warehouse database, except that these tables do not contain surrogate keys. Start with the fact tables that you plan to track in real time. These fact tables should look just like the corresponding fact table in the data warehouse database, with the same columns in the same order. The exceptions are vital :

The real-time fact table contains dimension business keys instead of surrogate keys. In other words, its the fact table just before the key substitution step.

The real-time fact table contains the date and time of the transaction, and any other information necessary to identify the transaction, like a transaction number. We generally recommend that you include such information in the main fact table, too, but certainly be sure to include it here.

Use a similar approach for the design of the dimension tables in the real-time database. First, you may want to include only those dimensions that are used by the facts youre tracking in real time. Create those dimension tables so they look just like the main dimension tables, but without the surrogate primary key or SCD-2 tracking columns like RowStartDate. Make sure you keep a datetime stamp for when the dimension member was added or changed.

Tip	Almost all real-time fact tables are transaction-grain facts. Snapshot facts, and particularly accumulating snapshot fact tables, are nearly impossible to maintain in real time. Stick to the underlying transaction grain.

Processing the Real-Time Data

In this section we briefly describe how to process the real-time data. There are four steps:

Start the day right. Start each day by moving the real-time data you collected yesterday into the data warehouse database and emptying out the tables in the real-time database. The easiest way to do this is to have two copies of the real-time database structure, and flip-flop between them at midnight.

Collect new rows and updates. As transactions occur in the source system, you need to collect them, run them through Integration Services packages, and write them to the real-time database. Use the techniques we described earlier in this chapter for running a package frequently or continuously. Collect all new facts, new dimension members, and changes to dimension members. For dimensions, continuously update all attributes during the day in the real-time database, without worrying about whether those attributes are Type1 or Type2.

Tip

If business users are querying the real-time database, you need to maintain referential integrity between facts and dimensions. This means youd need to institute complex logic for grabbing a dimension member from the data warehouse database, if you havent seen a change for that dimension member yet today. You would do better to copy the entire start-of-day dimension to the real-time database before you begin working on todays transactions. Make sure the real-time copy of the dimension has enough information that you can easily find any rows youve touched today.

Perform end-of-day processing. At the end of the day, the real-time database contains all the new dimension members that were added today, as well as the current end-of-day state of any dimension member that experienced an update to an attribute. All new facts are stored in the fact table. All these tables have their business keys, but other than that theyre clean and happy. In most cases, end-of-day processing means processing the dimensions to integrate the changes into the data warehouse dimension table. The end-of-day processing package for a dimension consists of little more than the slowly changing dimension transform that we described in Chapter 6. Similarly, the end-of-day processing package for a fact table consists of the surrogate key lookup pipeline from Chapter 6. This processing should be fast.

Clean up. Its a good idea to back up all these transactions. The easiest way to do this is to back up the entire database, which after all contains only todays data. Then clear out all the data, because youll be using this database tomorrow.

Querying the Real-Time Layer

Youve populated your real-time database with todays data. As the day goes on, the database is updatedperhaps hourly, perhaps more often. How will users query that data?

The vast majority of use of real-time data is through reports and applications rather than ad hoc queries. Most analysts arent very interested in real-time data. Most people who are excited about real-time data are in operational roles, which means they are report consumers.

A report on only todays data is straightforward. The problem comes when you need to combine todays data with some historical data. The reason this is a little challenging is that the real-time database has business keys, and the data warehouse database uses surrogate keys.

Your team needs to define reports and predefined queries that stitch together the real-time database and the data warehouse database. This may be as simple as a query that UNION s today with history, or you may need to join the result sets. Todays data is always queried with the current set of attributes. Depending on the reports business requirements, the historical data will include either the current attributes from the real-time partition or the historical attributes from the data warehouse. Given the nature of operational information needs, its more likely that youll use the current image of the dimension for the entire report.

If the real-time database is on the same server as the data warehouse database, a SQL expert can usually write a SQL statement to perform this integration. If the databases are on different servers, you could use a SQL Server distributed query. Or, try sourcing the report from an Integration Services package, as we discussed previously in this chapter.