Practical Business Intelligence with SQL Server 2005
We will be able to use Integration Services to accomplish the entire process of data extraction, error detection, data routing, and loading into the data warehouse. Independent Integration Services packages will be used to perform one-time merging of the customer lists, loading and cleansing of the facts, and updating of the dimension tables. The special features of Integration Services that are important to these processes are fuzzy lookup of rows using a similarity measure, and the ability of most transforms to direct rows to different data flows based on results of the transform. Merging and De-Duplicating Customer Data
We have a problem with our customer list that is preventing a complete view of our customers' shopping patterns. The customer lists from the two companies are in good shape in the context of the individual companies, but they were never merged, so we don't have a single consistent list of who our customers are. Creating the single list isn't easy because the information is essentially unreliable. It is entered by customers, either on a hand-written form or over the Web in free-form text. Getting exact matches that return only one identical customer from each list doesn't provide accurate content. If you attempt to match by name, you likely will get many false matches, which will combine some distinct customers into one. Including the city in the match likely won't solve the problem. Using the street address might be tempting, but matching with street address usually is not effective because of the variety of ways people enter street data. Street, St., and St all mean the same thing to us, but will obviously not be considered equal in any comparison in the SQL language. What we really want is a matching process that is a bit more forgiving. Fuzzy Matching
Fuzzy matching looks for the similarities between strings, but doesn't require a perfect match. Integration Services has two transforms that employ fuzzy matching. The Fuzzy Grouping transform helps you detect duplicates within a single table or data stream. The Fuzzy Lookup transform helps you find duplicates between two separate tables or between a data stream and a table. Rather than returning true or false as a match indicator, both transforms return a similarity rank between zero and one to indicate how good a match two rows are. We have two overlapping customer lists, so we'll use the Fuzzy Lookup transform to provide us with a list of probable matches between them. The transform requires a reference list and an input list. You can think of the reference list as the current "master" and the input list as a new set of customers. Because we are starting from scratch, we will arbitrarily choose one list to be the reference list and check the customers in the other list against it. The reference list will be directly added to the data warehouse as the initial customer list. We will then insert new customers from our second customer list. The new customers are ones that don't match any customers in the reference table. Our business requirement says we want to keep the latest customer information. That means for customers where we do find a match, we want to update customer rows that have a later LastUpdate date than those already in the reference table. The most challenging part of fuzzy matching is deciding what fields to use for comparison, and then deciding what level of similarity you will accept as a match. Name is a good criteria, but isn't sufficient by itself because there will be many different customers with the same (or similar) name. Street address and city are good candidates to resolve this problem. Zip code would be good, too, but "close" in numbers has two meanings. 98109 and 18109 only differ in one digit, but are on different sides of the continent. The concept of "similarity" using strings is probably easy to grasp. It is simply a measure of how close we (or in this case, Integration Services) thinks the two rows are to representing the same thing. Another measure provided by the Fuzzy Lookup transform, confidence, indicates how reliable Integration Services thinks the similarity measure is. You've probably heard of political pollsters providing an approval rating of a politician as "48 percent, 19 times out of 20." The 19 times out of 20 is a confidence ratingtheir prediction will, on average, be wrong one time in 20. You might need to experiment with the level of similarity and the confidence level to get the highest number of correct positive matches. It is important to realize that there almost certainly will be some false positives (that is, two customer records that matched within your constraints that are in fact two distinct customers). In an advertising campaign, this might not be too serious; if you are dealing with data that has privacy considerations, however, you might want to review all customer records that were merged to assure yourself that the correct decision was made.
We have created three tables, one for customers who were only in the Bricks and Mortar customer list and need to be added to the master list, one for customers who were on both company's lists but where the information in the Bricks and Mortar list is more current, and one where the Online store record is more current. The reason we created these tables, rather than immediately doing something with the data, is that we wanted an audit trail of how the data was handled. Your remaining task is to append the "CustomersToBeAdded" rows to the master customer list in the data warehouse, and to update the rows in the master customer list using the data in the "CustomersToUpdate" table. We won't drag you through this step by step, since you already know how to build this kind of transform. You would use a simple copy data flow to move the new customers to the master list, and an Execute SQL task to execute a T-SQL Update statement to update the existing customers with the more current information. Data Truncation Errors
When you are merging data from a number of sources, it is not uncommon to find that each source specifies a different length for some attribute that really should be of a consistent length. Often this is a result of laziness on the part of the original designer, who simply accepted a 50-character or 255-character default for a postal code. You will likely choose to define a more appropriate length in the data warehouse. Other times, it is a legitimate difference, and you must decide the maximum length you want to use in the data warehouse. You may choose a length that is less than the maximum of the largest source. In either case, there is a possibility that the data in one of the sources is longer than the destination. This will result in a truncation error being raised during the execution of the data flow transform. Truncation errors can also be raised if there will be a loss of significance in a numeric transform, such as assigning a 4-byte integer with a value greater than 32k to a 2-byte integer. In Integration Services, the default behavior in response to a truncation error is to raise an error and halt the process. If this is not the behavior you want, you can choose to either ignore the row that is causing the error or you can have the row redirected to another data flow. If you redirect the row, the column causing the error will be added to the data flow. You can use a Conditional Split transform to further refine the data flow for each column if necessary. In our example, we have set the target lengths to be large enough to accommodate the data from each source, and so have left the error handling for truncation to Fail component. Dealing with Missing Dimension Members
Missing dimension members can cause fact rows to be omitted from the data warehouse. If this occurs, your results will be wrong, and there will be no indication of a problem, unless you are proactive in detecting and correcting the problem. We have deliberately encouraged you to use Lookup transforms rather than joins for a simple reason: Lookups raise a trappable error if a member is missing, whereas a join just silently ignores the row. A Lookup gives you a chance to do something about the problem when it occurs. In this section, we show how to configure a Lookup transform to redirect failing rows to a different data flow for remediation. When you load facts into your data warehouse, they are expected to have a context provided by well-defined dimensions. For example, you expect to have a valid date, store, product, and customer for every sale fact, because you want to analyze your data that way. The dimensions contain membersthe known set of possible valuesfor every day, store, product, and customer. If one of those is missing from a fact row, you can't analyze the data the way you would like because you don't know how to completely categorize the sale. Or, if we don't handle the problem well, the fact will not be included in the warehouse, and your sales totals will be wrong. Needless to say, most of our customers find this effect undesirable! Regardless of why they are missing, your first task is to determine which incoming fact rows, if any, refer to dimension members that you don't have in your data warehouse. Detecting Missing Members
Remember that when you first receive a fact row, it contains business keys. Part of the work of building a data warehouse is transforming the business keys into surrogate keys. The surrogate keys uniquely identify a member within a dimension. It is during this transformation that you are first going to know whether a member is missing. In Integration Services, you use a Lookup transform to translate a business key into a surrogate key. You use one Lookup transform for each dimension related to the fact. You can control how each Lookup transform behaves when it doesn't find a matching business key. By default, an error is raised, and the Integration Services package stops processing any more data. Although this is effective in preventing bad data from reaching the data warehouse, it doesn't help you in identifying the problem, fixing it, or getting the good data through the process. We want to send the good data on and send the bad data down a different path. The Lookup transform can be configured to do exactly that.
Now we have separated data that has valid business keys from data that has business keys that we don't know about. Let's see what our options are for fixing the initial problem and maximizing the quality of the data in the data warehouse. We could do one of the following:
Throwing Away Facts
This is a simple and direct approach, but it doesn't have much place in most data warehouse applications. Your totals are pretty much guaranteed to be wrong. The row's measures are still correct; you just can't assign them a complete context because of the missing dimension member, so finding a way to retain the fact is worthwhile. If you can afford to ignore these facts, all you need to do to your Integration Services data flow is to configure the lookup to redirect the failing rows, but do not connect it to an OLE DB Destination. Keeping Track of Incomplete Facts
If you have facts that are missing one or more business keys, it is a good idea to keep track of how many and which ones. That is what we are currently doing in the previous Quick Start. Every failing fact row is written to a table, which you can analyze for row count, or evaluate the effect of not including these facts. However, you are still not taking full advantage of the available data. Park and Repair the Facts Later
This approach is good because all the data will eventually make its way to the data warehouse. It can be an essential step when the error is caused by misspelled business keys, which need some human intelligence to decipher and fix. In the meantime, the reports will show lower than actual values. You will need to create a simple application that presents each row to a user for repair and then inserts the row in a table or updates the source so the rows can be put through the ETL process again. Inferring a Dimension Member
Remember our business problem where a customer fills out a loyalty card at the register at the time of a purchase? We now know the loyalty number of the customer but nothing about them (because their information is not yet entered into the system). This sales fact is a candidate for rejection because it refers to a customer not in the customer dimension. What if you want the correct totals in your reports, even if the details aren't correct? What if we don't care so much about a member's properties? At least we could include the fact in our analysis right now. If you are sure that the business key is valid, and not a misspelling, you can create a temporary placeholder record in the dimension table, or what Integration Services calls an "inferred" member. We should only create inferred members for dimensions where we have a high expectation that the business key is correct. In our loyalty card example, the key is scanned in, so there is much less risk that it will be wrong than if it were keyed in by an operator. Why does this matter? Because we want to be able to come back and provide the detailed information when we finally get it. If there's a misspelling, we won't be able to match the business keys, and we'll have an extra member that has no real meaning. What properties do we really need to create an inferred member? Well, the business key is important; that's what we use to find the surrogate key. The good news is that we have a business key given to us by the source fact row. The surrogate key is simply assigned automatically by SQL Server. Unless the source fact row contains some denormalized information about the member, we are unlikely to be able to supply a value for any other attributes. For example, if the customer is inferred, we will not know their name, address, or anything else about them, because that information is not collected at the point of sale. Now for a little housekeeping. It is a good idea to mark your inferred members with a simple Boolean flag so that you know that they are incomplete. This flag will also prove to be useful in working with slowly changing dimensions, which are discussed in Chapter 8, "Managing Changing Data." When we receive the real dimension data for this member, if the flag tells us that the current row is an inferred member, we will just update the current row. This is particularly important for dimensions where we are tracking history because we don't want to create a new member if we just need to fill in the missing information for the current row. If the flag indicates the current row is not an inferred member, you simply handle the new dimension data as you normally would. Data Flow for Inferred Members
We'll modify the data flow to detect missing members we created previously. Instead of sending the facts to a table for later processing, we'll first route them to an OLE DB Command transform. This will call a stored procedure to insert the business key into the Customer table and return the surrogate key. The stored procedure will also check that we haven't already added this customer since we started this package. If we have, the surrogate key is simply returned. Finally, the two data flows are merged with a Union All task. You might now be wondering why the customer would be not found by the lookup if we have just added it to the customer table. The reason is that the customer table is cached by the Lookup transform, and the new customer doesn't appear in the cache. Tip: Continuing the Data Flow After an SQL Operation You can use an OLE DB Command transform rather than an OLE DB Destination to continue the data flow after executing an SQL command, such as an Insert statement. We need this capability here to allow the inferred Product member to be inserted and then continue the data flow to read the new surrogate key and rejoin the main data flow.
Legitimately Missing Data
The customer business key can be null in the fact records because not everyone has a loyalty card, so we couldn't capture the customer information when the transaction took place. The difference between this condition and a late-arriving customer member is that the customer business key is null, as opposed to a non-null value that isn't in the customer dimension table yet. We will never know who this customer was, and we have no way of distinguishing one missing customer from another. One thing is certain: We made a sale. If we ignore those records with null customer keys, however, the sales won't show up in the analysis. So, we have a new kind of subset of the facts to handle: legitimately missing data. Define Special Missing Members
To keep these sales in the data warehouse, we could substitute the missing customer business key with the key to a pre-defined member such as Walk-in, Public, or simply Unknown. We could also just set the surrogate key for this customer to a standard value, such as zero, to indicate that it is missing (you will need to create a member with the surrogate key value of 0). Establishing standard members for the different types of missing members is considered a best practice. It demonstrates that we saw the missing data and consciously categorized it. We didn't just ignore it. You create standard members during the initialization of the dimension table by inserting them directly into the dimension table before loading the real dimension data. Letting Analysis Services Handle Missing Members
You could also leave the surrogate key as null in the data warehouse. Analysis Services 2005 has added support for missing members. All you have to do is enable it in the properties of those dimensions that need it. Sounds easy. But before you reach for your mouse to enable this option, you need to ask whether users will access data only through Analysis Services, or whether they might use the relational data warehouse directly. Enabling support for unknown members does not change the null in the data warehouse, so inner joins will still fail, and this could continue to be a source of error in relational reporting. Here's our take on this. The data warehouse may be used directly by some end-user tools for a while. Without any extra work, it should provide the same answers as you would get through Analysis Services. There also may be more than one reason that a member is missing, but you only have one kind of null to work with. On that basis, you should define members in a dimension for all the different reasons a member could be missing and always use a non-null surrogate key in the fact tables. Now Analysis Services will work correctly for you, and so will any query against the relational data warehouse. There is now only one possible answer for the same question presented to the relational data warehouse and the cubes. We do appreciate that you could make the data warehouse behave just like Analysis Services using views to deal with the necessary outer joins and substitutions, but now you have two places to maintain business logic, and that is a maintenance headache we prefer to avoid. Data Flow for Legitimately Missing Members
Our approach for the customer dimension is to augment the data flow for when the business key is null, not just missing from the dimension table. We will add a Conditional Split task to the data flow that handles the inferred members. We'll route the fact rows with null business keys to another path where we will use a Derived Columns task to insert the Missing Member surrogate key into the data flow. Then we route this data flow back into the main flow with a Union All task and carry on with normal processing. As part of our initialization of our dimension, we inserted an Unknown member with a known surrogate key value. For our example, we set the surrogate key value to zero. Figure 7-10 shows the data flow. Figure 7-10. Data flow for handling legitimately missing members
Adding Row Counts for Auditing
One of the simplest and most effective audit controls you can have on the success of an ETL operation is to compare the number of rows input and the number of rows output. To do this effectively, you must have a simple query for the input to avoid the possibility that a join omits a row from the result set. Instead of joins, you would use a Lookup transform. You use a Row Count transform to assign the number of rows that flowed between two transforms to a package variable. By using two or more Row Count transforms at the beginning and ends of the data flow task, you can determine whether the task has dropped any rows, or how many rows have been redirected for error handling, as shown in Figure 7-11. To track this information, you can insert a task to the control flow immediately after the data flow task that writes the package variables along with the package name and execution date to an audit table. Figure 7-11. Auditing using the Row Count transform
Preventing Bad Data at the Source
One of the main sources of bad data is manually entered data. The business keys are rarely entered correctlyand why should a user have to remember the exact spelling of a business key in the first place? If you are expecting users to enter data such as budgets into a spreadsheet, you can reduce their frustration and increase data quality by supplying them with a spreadsheet with the business keys already entered on the sheet, in protected columns. You can create the spreadsheet by either using MS Query from within Excel to directly populate the spreadsheet from a dimension table or use Reporting Services to create the empty report (except for the keys) and then save it as an Excel spreadsheet. You can also consider populating drop-down lists from the dimension table to assist the user in choosing the correct key. In our example, sales forecasts are prepared at the Category level. To create a spreadsheet with the business key, the category name, and a column for the forecast amount, we created a report in Reporting Services based on a simple query of the DimProductCategory table, as shown here: SELECT ProductCategoryBusinessKey AS CategoryID, EnglishProductCategoryName AS Name, CAST(0 AS money) AS [Q1 Sales Estimate], CAST(0 AS money) AS [Q2 Sales Estimate], CAST(0 AS money) AS [Q3 Sales Estimate], CAST(0 AS money) AS [Q4 Sales Estimate] FROM DimProductCategory In the Report Wizard, all columns were added to the Detail level of the report. We used a Table style report. No subtotals were requested. The name of the report will be used for the sheet name, which you need to supply to the Excel data source when you import the completed spreadsheet. Figure 7-12 shows an example of the report we will export to Excel. You might want to add additional dimension members to the report (such as year, territory, salesperson), as necessary, for your application. Figure 7-12. Sample Excel budget template report
|