Microsoft SQL Server 2005 Integration Services
The Slowly Changing Dimension transform makes it easy to quickly generate complex Slowly Changing Dimension transform data flow. Table 22.1 contains the transform profile.
A full discussion about the nature of dimensions and slowly changing dimensions is beyond the scope of this book, but a brief discussion here will help establish a context for discussing the Slowly Changing Dimension Wizard and transform. A Typical Slowly Changing Dimension
Generally speaking, Business Intelligence is about capturing, analyzing, and understanding the trends, patterns, and history of a business so that companies can make more informed decisions. Facts record transactions and dimensions contain information about the transactions, such as customer name, time, weather, product, and advertising promotions. Slowly changing dimensions (SCD) result from the fact that over time, the world changes and dimension values must also be updated to reflect those changes. Not every dimension is a slowly changing dimension, and not every slowly changing dimension should be handled in the same way. One example of a dimension might be time. Time doesn't change. October 3, 1963, will always be October 3, 1963. However, marriage status, sales district, last names, and, nowadays, even gender dimension attributes do. Even then, you don't necessarily want to update every change in the same way. You might want to maintain the history of some changes, overwrite others, or simply detect when a change is attempted. For example, if the sales districts for a company are reorganized, to ensure that sales reports and queries remain accurate, you might want to keep a record of the previous sales districts for all salespersons. But, because customer name changes happen so rarely and because names are simply identifiers for the same entity, you might choose to overwrite last name attributes of customer dimensions. This is a fundamental problem with dimensions, and the Slowly Changing Dimension Wizard provides a way to implement the logic to handle these changes. "Type 1" Slowly Changing Dimensions
Consider a simple example of an existing customer, Tom Brokenjaw, who is moving from Washington to New York on July 21st. If you don't really need to track the fact that, prior to moving to New York, Tom once lived in Washington, you can simply overwrite the state dimension. Table 22.2 shows how the dimension would appear before updating the dimension and Table 22.3 shows the change.
In this case, the state value is overwritten. This is known as a Type 1 dimension, and in Integration Services is referred to as a Changing Attribute Change. "Type 2" Slowly Changing Dimensions
Now, consider a scenario in which it is important to track the fact that Tom lived in Washington before moving to New York on July 21st. To accomplish this, you would create a new record for Tom in the customer dimension table and mark the old record as follows. Table 22.4 shows the record before updating. Table 22.5 shows the additional record.
Notice that there are two more attributes in the dimension for Type 1 records because Type 2 changes preserve history, providing a way to determine when a particular change took effect. Notice also that the old record now has the current record flag set to FALSE and the new record is flagged as the current record indicating that Tom now resides in New York, thus preserving the history of Tom's state of residence. Slowly changing dimensions using this technique are referred to as Type 2 dimensions and are, by far, the most common type of slowly changing dimension. In Integration Services, this type of change is called a Historical Attribute Change. Additional Slowly Changing Dimension Types
Type 1 or Type 2 are the most common change types, but occasionally you might have late-arriving dimension information or you might want to detect when a dimension attribute changes. Although the SCD does not directly support Type 3 changes, it is possible to detect when a Type 3 change might be appropriate, as described in the following section. Inferred Member
Traditionally, late-arriving dimensions have been exceptional cases, such as data-entry errors or back-dated, real-world events. For example, occasionally, customers don't alert the company of address changes until months after they occurred or product manufacturing processes or parts change but are not correctly updated. With the emergence of new technologies and business models, late-arriving dimensions are becoming more common. For example, grocery stores offer membership cards that you can use for discounts on items in the store. The membership card has a unique ID that tracks the buying habits of the card holder even if the card holder hasn't provided any information about herself to the grocery store. The customer might purchase groceries, which represent facts, but might not provide any information about herself at the time she makes her initial purchase with the membership card. Even without the customer information, the grocery store still needs to capture the transaction information as fact records and the fact records need to reference the membership card dimension record. However, the card holder might later provide her personal information, which then needs to be added to the dimension record as the late-arriving dimension. Another example is furnaces and other home appliances that carry warranties. Large builders sometimes purchase such appliances in bulk to get discounted pricing. When a builder purchases the appliance, they often do not know who the ultimate owner will be or even where the utility will be installed. Only when the owner purchases the house, condo, or other building and registers the appliance will that information become available. When that happens, they will provide new information that needs to be inserted into the existing dimension record for that appliance warranty. Inferred member changes are dimension records created in anticipation of such late-arriving dimension data. Fixed Attribute
A Fixed Attribute change type enforces no changes. This change type detects when a record has changed so that it can be recorded and handled in the downstream data flow. "Type 3" Slowly Changing Dimensions
Type 3 Slowly Changing Dimensions (SCD) are very rare and occur when a business must track the original and current values of only certain specific attributes of a dimension record. Because Type 3 Slowly Changing Dimensions are uncommon and require that the dimension table be modified, the Slowly Changing Dimension Wizard doesn't support them. Tip If you want to implement a Type 3 SCD, you can mark the column you want to update with the Fixed Attribute change type and then capture the Fixed Attribute output rows to discover Type 3 attribute candidates.
The Slowly Changing Dimension Wizard
OK, let's put the Slowly Changing Dimension Wizard through its paces. The package you're about to build is the SCD.dtsx package in the book samples in the S21-Advanced solution. The wizard takes you through six dialog boxes and a summary dialog box, as described in the following sections. The Slowly Changing Dimension Wizard Steps
The following are the general steps the Slowly Changing Dimension Wizard walks you through for defining the additional data flow and the runtime behavior of the Slowly Changing Dimension transform.
Running the Slowly Changing Dimension Wizard
The preceding steps show the dialog boxes the wizard walks you through to generate the data flow. Let's run through the wizard and see what it produces. Make sure you have the AdventureWorks and AdventureWorksDW sample databases installed. To run the Slowly Changing Dimension Wizard, open the designer and perform the following steps:
When you finish the wizard, it takes all the settings you've input and creates a data flow. If you've followed along with the preceding steps, the Slowly Changing Dimension Wizard generates a data flow similar to what is shown in Figure 22.7. Figure 22.7. The data flow resulting from the Slowly Changing Dimension Wizard
The Generated Data Flow
When you complete the Slowly Changing Dimension Wizard, you end up with a data flow that contains a Slowly Changing Dimension transform, with varying numbers of outputs going to various transforms and destination adapters. At execution time, the Slowly Changing Dimension transform plays the role of row diverter. By applying the rules established in the wizard, the SCD transform decides which output the source rows should flow down, letting the downstream data flow process the rows appropriately. Let's take a look at each of the resulting data flows that result from the settings you entered into the wizard. The New Output Data Flow
The New Output data flow handles all new dimension rows. Depending upon whether your dimension table supports historical attribute change types, this output might vary. This is basically a straight insert into the dimension table after ensuring that the Current column is set correctly. In the previous example, the DimEmployee table does support the historical attribute change type, so the resulting data flow for the New Output has a Derived Column transformation to set the CurrentFlag column to trUE. The Historical Attribute Output Data Flow
The Historical Output data flow is similar to the New Output data flow because it is essentially doing the same thinginserting a new record. The difference is that the historical attribute must change the last record current column to indicate that it is now expired. So, the additional data flow has an OLE DB Command transform that sets the current record to be expired with the following command: UPDATE [dbo].[DimEmployee] SET [CurrentFlag] = ? WHERE [EmployeeNationalIDAlternateKey] = ?
This sets the current dimension's CurrentFlag to FALSE and the key is the business key. If you selected the option to set all historical attribute records to be expired in the wizard, the command has the following condition added: UPDATE [dbo].[DimEmployee] SET [CurrentFlag] = ? WHERE [EmployeeNationalIDAlternateKey] = ? AND [CurrentFlag] = '1'
This ensures that when the new dimension record is inserted, only the new one will be considered current. The rest of the data flow for the historical attribute is identical to the New Output and so is merged into the New Output data flow with a Union All. Changing Attribute Updates Data Flow
The Changing Attribute Updates data flow is quite simple because all it is required to do is update the existing attribute in the given dimension record. Recall that you set the MaritalStatus attribute as the Changing Attribute change type. The resulting data flow is simply an OLE DB Command transform with the following update command: UPDATE [dbo].[DimEmployee] SET [MaritalStatus] = ? WHERE [EmployeeNationalIDAlternateKey] = ?
The MaritalStatus value is the new column value from the source and the key is the current business key. Inferred Member Updates Data Flow
The records flowing down this output are bound for existing dimension records, so an update is in order. The OLE DB Command transform on this output performs the following UPDATE command: UPDATE [dbo].[DimEmployee] SET [BirthDate] = ?, [CurrentFlag] = ?, [Gender] = ?, [HireDate] = ?, [LoginID] = ?, [MaritalStatus] = ?, [SalariedFlag] = ?, [SickLeaveHours] = ?, [Title] = ?, [VacationHours] = ? WHERE [EmployeeNationalIDAlternateKey] = ? The attribute values are the new values from the source and the key is the current business key. Fixed Attribute Output Data Flow
The Slowly Changing Dimension Wizard does not automatically generate any data flow for this output, but it might be sent to a destination or other transform to detect possible Type 3 change candidates or as a way to audit columns that should not be changing. Unchanged Output Data Flow
The Slowly Changing Dimension Wizard does not automatically generate any data flow for this output because the SCD transform doesn't do any work for these rows. You can connect this output and the SCD transform pushes all the rows that were current to the output. How the SCD Processes Rows
When you first run the Slowly Changing Dimension Wizard, you might be a bit overwhelmed when it generates up to six outputs, each with their own data flow. Previously, you examined the various data flows that the Slowly Changing Dimension Wizard automatically generated. Now let's take a look at how the SCD transform processes each incoming row. Figure 22.8 graphically shows the process the SCD transform uses for diverting rows down its various outputs and the expected processing that should occur for each output. Figure 22.8. Slowly Changing Dimension transformation processing steps
Performance Considerations
The primary design goals for the Slowly Changing Dimension Wizard and transform were to make the process of building robust SCD handling data flow quick and easy. As you can see, the Slowly Changing Dimension Wizard makes it easy to quickly build data flow for handling slowly changing dimensions and is useful for those wanting to quickly generate such packages. However, after you start dealing with larger dimensional datasets, you'll probably want to modify the automatically generated data flow to improve performance. The following are some optimizations you might consider for optimizing the generated SCD data flow. Limiting the Dimension Recordset
When processing Type 2 slowly changing dimensions, the SCD transform performs a lookup for every incoming row to determine if there is an existing dimension. You can speed up the lookup by eliminating the number of records in the lookup. In fact, you can have the lookup performed on a different table, perhaps a temporary table with only one indexed column containing the subset of the alternate or surrogate keys that are currently active. Staging Outputs
The Slowly Changing Dimension Wizard automatically generated data flow makes heavy use of the OLE DB Command transform. Although powerful and useful, this transform performs row-by-row operations, which can be very expensive. To eliminate the OLE DB Command transform, replace it with an OLE DB Destination and stage the data for both update rows and insert rows. Then, later in the workflow, execute separate INSERT and UPDATE commands using the staged data as the source. Because both INSERT and UPDATE statements allow you to join and compare two or more tables, you can conditionally update and insert rows as necessary. Ensure that you run the updates before the inserts. Running them concurrently can create locking issues. Use a Lookup Transform Before the SCD Transform
Consider using a stock Lookup transform with partial caching turned on right before the SCD to quickly identify rows that don't exist in the dimension table. Those rows that do not exist can be directly inserted. This effectively eliminates the rows flowing through the New Output from the SCD transform and increases the efficiency of the SCD transform lookup step. New Dimension Tables
When building a new dimension table, the SCD transform has a property named IncomingRowChangeType with the default value Detect. If the dimension table is empty, you can set this value to AllNew the first time you run the package and the SCD redirects all rows to the New Output. The Slowly Changing Dimension transform provides a way to quickly and easily build complex dimension update data flow logic that is robust and customizable. The SCD transform supports four dimension change types, including changing, historic, fixed, and inferred member. For many applications, the Slowly Changing Dimension Wizard generated data flow is sufficient. However, if needed for increasingly large data loads, the SCD transform also lends itself well to optimizations. |