Practical Business Intelligence with SQL Server 2005

2017-07-07 02:10:07

We will be able to use Integration Services to accomplish the entire process of data extraction, error detection, data routing, and loading into the data warehouse. Independent Integration Services packages will be used to perform one-time merging of the customer lists, loading and cleansing of the facts, and updating of the dimension tables. The special features of Integration Services that are important to these processes are fuzzy lookup of rows using a similarity measure, and the ability of most transforms to direct rows to different data flows based on results of the transform.

Merging and De-Duplicating Customer Data

We have a problem with our customer list that is preventing a complete view of our customers' shopping patterns. The customer lists from the two companies are in good shape in the context of the individual companies, but they were never merged, so we don't have a single consistent list of who our customers are. Creating the single list isn't easy because the information is essentially unreliable. It is entered by customers, either on a hand-written form or over the Web in free-form text. Getting exact matches that return only one identical customer from each list doesn't provide accurate content. If you attempt to match by name, you likely will get many false matches, which will combine some distinct customers into one. Including the city in the match likely won't solve the problem. Using the street address might be tempting, but matching with street address usually is not effective because of the variety of ways people enter street data. Street, St., and St all mean the same thing to us, but will obviously not be considered equal in any comparison in the SQL language. What we really want is a matching process that is a bit more forgiving.

Fuzzy Matching

Fuzzy matching looks for the similarities between strings, but doesn't require a perfect match. Integration Services has two transforms that employ fuzzy matching. The Fuzzy Grouping transform helps you detect duplicates within a single table or data stream. The Fuzzy Lookup transform helps you find duplicates between two separate tables or between a data stream and a table. Rather than returning true or false as a match indicator, both transforms return a similarity rank between zero and one to indicate how good a match two rows are.

We have two overlapping customer lists, so we'll use the Fuzzy Lookup transform to provide us with a list of probable matches between them. The transform requires a reference list and an input list. You can think of the reference list as the current "master" and the input list as a new set of customers. Because we are starting from scratch, we will arbitrarily choose one list to be the reference list and check the customers in the other list against it. The reference list will be directly added to the data warehouse as the initial customer list. We will then insert new customers from our second customer list. The new customers are ones that don't match any customers in the reference table. Our business requirement says we want to keep the latest customer information. That means for customers where we do find a match, we want to update customer rows that have a later LastUpdate date than those already in the reference table.

The most challenging part of fuzzy matching is deciding what fields to use for comparison, and then deciding what level of similarity you will accept as a match. Name is a good criteria, but isn't sufficient by itself because there will be many different customers with the same (or similar) name. Street address and city are good candidates to resolve this problem. Zip code would be good, too, but "close" in numbers has two meanings. 98109 and 18109 only differ in one digit, but are on different sides of the continent. The concept of "similarity" using strings is probably easy to grasp. It is simply a measure of how close we (or in this case, Integration Services) thinks the two rows are to representing the same thing.

Another measure provided by the Fuzzy Lookup transform, confidence, indicates how reliable Integration Services thinks the similarity measure is. You've probably heard of political pollsters providing an approval rating of a politician as "48 percent, 19 times out of 20." The 19 times out of 20 is a confidence ratingtheir prediction will, on average, be wrong one time in 20. You might need to experiment with the level of similarity and the confidence level to get the highest number of correct positive matches. It is important to realize that there almost certainly will be some false positives (that is, two customer records that matched within your constraints that are in fact two distinct customers). In an advertising campaign, this might not be too serious; if you are dealing with data that has privacy considerations, however, you might want to review all customer records that were merged to assure yourself that the correct decision was made.

Quick Start: Merging Lists Using Fuzzy Matching

This Integration Services package will merge our list of bricks-and-mortar customers into the list of online customers. First, we'll copy the online list into the data warehouse. Then we'll determine which customers are common between the lists and which ones are not on the online list. The process produces three intermediate tables, which will contain the customer IDs of the new customers, the existing customers that need an update, and existing customers that are current. We will use these tables later to drive the inserts and updates to the master customer list.

1.	Create a new Integration Services package and name it Merge Customer Lists.
2.	Create three connection managers: one for the online store source, one for the bricks-and-mortar data source, and one for the data warehouse. Name them OnlineSource, BricksAndMortarSource, and DataWarehouse.
3.	Drag a Data Flow task from the toolbox onto the design surface and rename it Copy Online Customers. Edit the data flow task and create a simple copy of the online customer table to your master customer table in the data warehouse. (You already know how to do this from Chapter 4, "Integrating Data.")
4.	Drag a Data Flow task from the toolbox onto the design surface, rename it Create Matching List, and connect it to the control flow from the Copy Online Customers task. Open the Create Matching List task for editing.
5.	Drag an OLE DB Source onto the data flow design surface and name it Bricks and Mortar Customers.
6.	Edit the data source, and specify the Bricks and Mortar connection manager for your source data.
7.	Specify the table or view, or create a SQL query, that contains the primary key, all the columns you want to match on, and the last-modified-date column. In our example, we are going to match on First Name, Last Name, Street Address, City, and StateProvince. We built a view joining the Customer and Address tables on CustomerID, but a SQL command would work just as well.

8.	Drag a Fuzzy Lookup transform onto the design surface and label it Match to Customer Master. Connect the Bricks and Mortar Customers data source output to the fuzzy lookup.
9.	Edit the Fuzzy Lookup transform. On the Reference Table tab (see Figure 7-3), set the connection manager property to the DataWarehouse connection manager. This is where the master customer list is maintained. The data warehouse will already contain the online store customers when this step executes. Figure 7-3. Configuring the Fuzzy Lookup inputs
10.	Set the Reference Table to MasterCustomerList, which is our view in the data warehouse containing the data to be matched against the incoming customer list. (We created this view outside of this quick start.)

11.	On the Columns tab (see Figure 7-4), set the mappings between the two lists so that only the columns we want to match on are mapped. This would be First Name, Last Name, Street Address, City, and StateProvince. Figure 7-4. Configuring the Fuzzy Lookup columns
12.	Check Pass Through for the CustomerID and ModifiedDate. They are needed for identifying the most current customer record. Also click Pass Through for the other columns that you want to store in the master customer list.
13.	In the Available Lookup Columns list, check the CustomerKey and ModifiedDate.
14.	The output of the fuzzy lookup will contain the customer business key from the Bricks and Mortar store, the customer surrogate key from the master customer list, and last modified date for both input tables. For clarity, provide an output alias for the reference (master) input to differentiate CustomerKey and ModifiedDate from the two sources, such as MasterCustomerKey and MasterModifiedDate.

15.	On the Advanced tab (see Figure 7-5), check that the Maximum number of matches to output per lookup is set to 1. Because the data in each list is clean (no duplicates), we expect no more than one match to occur. The most similar match exceeding the overall Similarity threshold will be returned. We'll set the Similarity threshold to 0 to allow at least one row through for each input row. The conditional split following this lookup will then be able to examine the similarity of each column. This will give you the opportunity, if you choose, to weight the similarity of each column differently before deciding if a row is a good match. For example, we may choose to require a strong similarity on last name, with less emphasis on the similarity of the first name. The conditional split will redirect the row to the appropriate data flow based on the similarity and dates in the comparable rows. Figure 7-5. Configuring the fuzzy lookup threshold
16.	Click OK to save this transform. At this point, if you could run this transform by itself, you would get a list of the customer IDs of all rows in the input stream, the key of their best possible match in the reference table, and the similarity score of the possible match. Rows that have a good match will have a low similarity score. You will also know which row of the match is the most recent, because we included the ModifiedDate in the stream. For each input row, we want to insert it into the data warehouse if it is new (low similarity to the existing list), update an existing row if the input row is newer than the match in the data warehouse (high similarity and later ModifiedDate), or discard the input row if we already have the row in the data warehouse (high similarity but earlier ModifiedDate). A Conditional Split transform is exactly what we need to split our input stream into these three paths.

17.	Drag a Conditional Split transform onto the design surface and connect the Fuzzy Lookup transform to it.
18.	Edit the conditional split transform to create the three data flows we want, as shown in Figure 7-6. Click in the Output Name column, enter a name such as "Bricks and Mortar is Current," press Tab, and enter the condition for data entering this stream. In this case, we want a high similarity and the input stream ModifiedDate to be later than the master Modified date. This is expressed as follows: (_Similarity > .75) && (ModifiedDate > MasterModifiedDate) Figure 7-6. Configuring a Conditional Split transform
19.	Repeat for the second output, where there is a match but the master is more current. Set the name to "Master is Current," and the condition for data entering this stream to this: (_Similarity > .75) && (ModifiedDate <= MasterModifiedDate)
20.	Set the Default output name to "New Customers." All rows that don't meet conditions one or two will flow through the default output stream. Click OK to save this transform.

21.	Drag an OLE DB Destination onto the design surface, rename it Add Customers, and set the Connection property to the BricksAndMortarDataSource connection manager and the table to CustomersToBeAdded. (Click New to create this table if you don't have it already.) Connect the conditional split transform to this destination and choose the New Customers output.
22.	Drag an OLE DB Destination onto the design surface and rename it Customers to Update. Set the Connection property to the BricksAndMortarDataSource connection manager. Click New to create a new table, "CustomersToUpdate," to hold the rows you will use in a later update step. Connect the Conditional Split to this destination and choose the Bricks and Mortar is Current output stream.
23.	Drag another OLE DB Destination onto the design surface so that you can save the rows that were marked as duplicates. Rename this destination No Action. Connect the Conditional Split transform to this destination and choose the Online is Current stream. This table is created to support auditing of the customer merge process.
24.	Save this package. The completed data flow for fuzzy matching should now look like that in Figure 7-7. Figure 7-7. Data flow for merging members using fuzzy matching

We have created three tables, one for customers who were only in the Bricks and Mortar customer list and need to be added to the master list, one for customers who were on both company's lists but where the information in the Bricks and Mortar list is more current, and one where the Online store record is more current. The reason we created these tables, rather than immediately doing something with the data, is that we wanted an audit trail of how the data was handled.

Your remaining task is to append the "CustomersToBeAdded" rows to the master customer list in the data warehouse, and to update the rows in the master customer list using the data in the "CustomersToUpdate" table. We won't drag you through this step by step, since you already know how to build this kind of transform. You would use a simple copy data flow to move the new customers to the master list, and an Execute SQL task to execute a T-SQL Update statement to update the existing customers with the more current information.

Data Truncation Errors

When you are merging data from a number of sources, it is not uncommon to find that each source specifies a different length for some attribute that really should be of a consistent length. Often this is a result of laziness on the part of the original designer, who simply accepted a 50-character or 255-character default for a postal code. You will likely choose to define a more appropriate length in the data warehouse. Other times, it is a legitimate difference, and you must decide the maximum length you want to use in the data warehouse. You may choose a length that is less than the maximum of the largest source. In either case, there is a possibility that the data in one of the sources is longer than the destination. This will result in a truncation error being raised during the execution of the data flow transform. Truncation errors can also be raised if there will be a loss of significance in a numeric transform, such as assigning a 4-byte integer with a value greater than 32k to a 2-byte integer.

In Integration Services, the default behavior in response to a truncation error is to raise an error and halt the process. If this is not the behavior you want, you can choose to either ignore the row that is causing the error or you can have the row redirected to another data flow. If you redirect the row, the column causing the error will be added to the data flow. You can use a Conditional Split transform to further refine the data flow for each column if necessary.

In our example, we have set the target lengths to be large enough to accommodate the data from each source, and so have left the error handling for truncation to Fail component.

Dealing with Missing Dimension Members

Missing dimension members can cause fact rows to be omitted from the data warehouse. If this occurs, your results will be wrong, and there will be no indication of a problem, unless you are proactive in detecting and correcting the problem. We have deliberately encouraged you to use Lookup transforms rather than joins for a simple reason: Lookups raise a trappable error if a member is missing, whereas a join just silently ignores the row. A Lookup gives you a chance to do something about the problem when it occurs. In this section, we show how to configure a Lookup transform to redirect failing rows to a different data flow for remediation.

When you load facts into your data warehouse, they are expected to have a context provided by well-defined dimensions. For example, you expect to have a valid date, store, product, and customer for every sale fact, because you want to analyze your data that way. The dimensions contain membersthe known set of possible valuesfor every day, store, product, and customer. If one of those is missing from a fact row, you can't analyze the data the way you would like because you don't know how to completely categorize the sale. Or, if we don't handle the problem well, the fact will not be included in the warehouse, and your sales totals will be wrong. Needless to say, most of our customers find this effect undesirable!

Regardless of why they are missing, your first task is to determine which incoming fact rows, if any, refer to dimension members that you don't have in your data warehouse.

Detecting Missing Members

Remember that when you first receive a fact row, it contains business keys. Part of the work of building a data warehouse is transforming the business keys into surrogate keys. The surrogate keys uniquely identify a member within a dimension. It is during this transformation that you are first going to know whether a member is missing.

In Integration Services, you use a Lookup transform to translate a business key into a surrogate key. You use one Lookup transform for each dimension related to the fact. You can control how each Lookup transform behaves when it doesn't find a matching business key. By default, an error is raised, and the Integration Services package stops processing any more data. Although this is effective in preventing bad data from reaching the data warehouse, it doesn't help you in identifying the problem, fixing it, or getting the good data through the process. We want to send the good data on and send the bad data down a different path. The Lookup transform can be configured to do exactly that.

Quick Start: Redirecting Failing Lookup Data

If a lookup of a fact row fails, we want to redirect the row to a separate data flow. In this example, we'll send the errant facts to a new table for later evaluation; otherwise, we'll just let the row continue on to the next transform. There is one Lookup transform for each dimension, and each transform will detect a different error. However, it is still the same fact that is causing the problem, so we'll put all failing rows of the same fact into the same table and add an indicator to the row that will identify the specific Lookup that caused the problem.

We skip the parts of the package design you already knowcreating a data flow task, adding OLE DB source and destinations, and doing simple lookups. We instead concentrate on handling the error processing and assume you have a data flow from the source to the destination, including any lookups.

1.	Open your SalesFact Integration Services package for editing.
2.	Drop a Derived Column transform beside the Product Lookup transform. Rename it "Set reason code to Product."
3.	Connect the red error output from the Lookup transform to the Derived Column transform. There will be a pop-up dialog asking you to configure the error output. Choose Redirect row under the Error column. Click OK.
4.	Edit the Derived Column transform. Add a new column named ErrorReason, set the expression to "Missing product member," and set the Length to 50. Click OK.
5.	Add additional Derived Column transforms, one for each Lookup transform, giving each a descriptive name and setting the expression to Missing ... (according to what was missing).

6.	Drag a Union All transform on to the design surface, underneath and to the right of all the Derived Column transforms. Connect each of the Derived Column outputs to the Union All transform. This gives us back one data flow instead of a separate data flow for each Lookup failure.
7.	Drag an OLE DB Destination task onto the page and click New for the table destination. Change the table name to something like BadSalesFacts.
8.	Check the column mapping to see that all the columns you want are copied to the table.
9.	Click Finish. Your data flow should look like that in Figure 7-8. Figure 7-8. Data flow for handling missing dimension members

Now we have separated data that has valid business keys from data that has business keys that we don't know about. Let's see what our options are for fixing the initial problem and maximizing the quality of the data in the data warehouse. We could do one of the following:

Throw away the fact.

Throw away the fact but keep track of the number of those rows to provide an index of the completeness of the data.

Park the row for later review and repair.

Create a new temporary dimension member for the missing member.

Substitute the surrogate key of a pre-defined default Missing or N/A member into the fact table in place of the missing member.

Throwing Away Facts

This is a simple and direct approach, but it doesn't have much place in most data warehouse applications. Your totals are pretty much guaranteed to be wrong. The row's measures are still correct; you just can't assign them a complete context because of the missing dimension member, so finding a way to retain the fact is worthwhile. If you can afford to ignore these facts, all you need to do to your Integration Services data flow is to configure the lookup to redirect the failing rows, but do not connect it to an OLE DB Destination.

Keeping Track of Incomplete Facts

If you have facts that are missing one or more business keys, it is a good idea to keep track of how many and which ones. That is what we are currently doing in the previous Quick Start. Every failing fact row is written to a table, which you can analyze for row count, or evaluate the effect of not including these facts. However, you are still not taking full advantage of the available data.

Park and Repair the Facts Later

This approach is good because all the data will eventually make its way to the data warehouse. It can be an essential step when the error is caused by misspelled business keys, which need some human intelligence to decipher and fix. In the meantime, the reports will show lower than actual values. You will need to create a simple application that presents each row to a user for repair and then inserts the row in a table or updates the source so the rows can be put through the ETL process again.

Inferring a Dimension Member

Remember our business problem where a customer fills out a loyalty card at the register at the time of a purchase? We now know the loyalty number of the customer but nothing about them (because their information is not yet entered into the system). This sales fact is a candidate for rejection because it refers to a customer not in the customer dimension. What if you want the correct totals in your reports, even if the details aren't correct? What if we don't care so much about a member's properties? At least we could include the fact in our analysis right now. If you are sure that the business key is valid, and not a misspelling, you can create a temporary placeholder record in the dimension table, or what Integration Services calls an "inferred" member.

We should only create inferred members for dimensions where we have a high expectation that the business key is correct. In our loyalty card example, the key is scanned in, so there is much less risk that it will be wrong than if it were keyed in by an operator. Why does this matter? Because we want to be able to come back and provide the detailed information when we finally get it. If there's a misspelling, we won't be able to match the business keys, and we'll have an extra member that has no real meaning.

What properties do we really need to create an inferred member? Well, the business key is important; that's what we use to find the surrogate key. The good news is that we have a business key given to us by the source fact row. The surrogate key is simply assigned automatically by SQL Server. Unless the source fact row contains some denormalized information about the member, we are unlikely to be able to supply a value for any other attributes. For example, if the customer is inferred, we will not know their name, address, or anything else about them, because that information is not collected at the point of sale.

Now for a little housekeeping. It is a good idea to mark your inferred members with a simple Boolean flag so that you know that they are incomplete. This flag will also prove to be useful in working with slowly changing dimensions, which are discussed in Chapter 8, "Managing Changing Data." When we receive the real dimension data for this member, if the flag tells us that the current row is an inferred member, we will just update the current row. This is particularly important for dimensions where we are tracking history because we don't want to create a new member if we just need to fill in the missing information for the current row. If the flag indicates the current row is not an inferred member, you simply handle the new dimension data as you normally would.

Data Flow for Inferred Members

We'll modify the data flow to detect missing members we created previously. Instead of sending the facts to a table for later processing, we'll first route them to an OLE DB Command transform. This will call a stored procedure to insert the business key into the Customer table and return the surrogate key. The stored procedure will also check that we haven't already added this customer since we started this package. If we have, the surrogate key is simply returned. Finally, the two data flows are merged with a Union All task. You might now be wondering why the customer would be not found by the lookup if we have just added it to the customer table. The reason is that the customer table is cached by the Lookup transform, and the new customer doesn't appear in the cache.

Tip: Continuing the Data Flow After an SQL Operation

You can use an OLE DB Command transform rather than an OLE DB Destination to continue the data flow after executing an SQL command, such as an Insert statement. We need this capability here to allow the inferred Product member to be inserted and then continue the data flow to read the new surrogate key and rejoin the main data flow.

Quick Start: Implementing Inferred Members

If we know the member we are missing is just late arriving, we can create a temporary inferred member so that at least we have the surrogate key for that member.


1.	Create a stored procedure to add the new customer and return the surrogate key: create procedure AddCustomer @LoyaltyNumber as nvarchar(12), @CustomerKey as int out as Check if customer exists SELECT @CustomerKey=CustomerKey FROM dbo.DimCustomer WHERE LoyaltyNumber=@LoyaltyNumber if @@Rowcount > 0 begin Customer has already been inserted return @CustomerKey already set by SELECT above end Insert inferred customer member INSERT dbo.DimCustomer (LoyaltyNumber) VALUES (@LoyaltyNumber) Return new surrogate key SELECT @CustomerKey = SCOPE_IDENTITY ()
2.	Delete the Derived Column transform connected to the Customer Lookup transform.
3.	Drag an OLE DB Command transform onto the workspace. Name the transform Create Inferred Customer, connect the Error output of the Customer Lookup transform to it, and then edit it.
4.	On the Connection Manager tab, set the Connection Manager field to the data warehouse connection.
5.	On the Component Properties tab, set the SQL Statement property to this: EXEC AddCustomer ?, ? OUT The `OUT` option is important! The new surrogate key will not be returned if you leave it out.
6.	On the Column Mappings tab, set the Parameters so the input column LoyaltyNumber is mapped to @LoyaltyNumber and CustomerID is mapped to @CustomerKey. Save this transform.
7.	Drag a Union All task onto the workspace.

8.	Break the normal data flow coming out of the original Customer Key Lookup transform.
9.	Connect the output from the Customer Lookup transforms to the input of the Union All task.
10.	Connect the output of the Union All task to the next task in the original data flow. Your updated data flow should now look like the data flow in Figure 7-9. Figure 7-9. Data flow for handling inferred members

Legitimately Missing Data

The customer business key can be null in the fact records because not everyone has a loyalty card, so we couldn't capture the customer information when the transaction took place. The difference between this condition and a late-arriving customer member is that the customer business key is null, as opposed to a non-null value that isn't in the customer dimension table yet. We will never know who this customer was, and we have no way of distinguishing one missing customer from another. One thing is certain: We made a sale. If we ignore those records with null customer keys, however, the sales won't show up in the analysis. So, we have a new kind of subset of the facts to handle: legitimately missing data.

Define Special Missing Members

To keep these sales in the data warehouse, we could substitute the missing customer business key with the key to a pre-defined member such as Walk-in, Public, or simply Unknown. We could also just set the surrogate key for this customer to a standard value, such as zero, to indicate that it is missing (you will need to create a member with the surrogate key value of 0). Establishing standard members for the different types of missing members is considered a best practice. It demonstrates that we saw the missing data and consciously categorized it. We didn't just ignore it. You create standard members during the initialization of the dimension table by inserting them directly into the dimension table before loading the real dimension data.

Letting Analysis Services Handle Missing Members

You could also leave the surrogate key as null in the data warehouse. Analysis Services 2005 has added support for missing members. All you have to do is enable it in the properties of those dimensions that need it. Sounds easy. But before you reach for your mouse to enable this option, you need to ask whether users will access data only through Analysis Services, or whether they might use the relational data warehouse directly. Enabling support for unknown members does not change the null in the data warehouse, so inner joins will still fail, and this could continue to be a source of error in relational reporting.

Here's our take on this. The data warehouse may be used directly by some end-user tools for a while. Without any extra work, it should provide the same answers as you would get through Analysis Services. There also may be more than one reason that a member is missing, but you only have one kind of null to work with. On that basis, you should define members in a dimension for all the different reasons a member could be missing and always use a non-null surrogate key in the fact tables. Now Analysis Services will work correctly for you, and so will any query against the relational data warehouse. There is now only one possible answer for the same question presented to the relational data warehouse and the cubes. We do appreciate that you could make the data warehouse behave just like Analysis Services using views to deal with the necessary outer joins and substitutions, but now you have two places to maintain business logic, and that is a maintenance headache we prefer to avoid.

Data Flow for Legitimately Missing Members

Our approach for the customer dimension is to augment the data flow for when the business key is null, not just missing from the dimension table. We will add a Conditional Split task to the data flow that handles the inferred members. We'll route the fact rows with null business keys to another path where we will use a Derived Columns task to insert the Missing Member surrogate key into the data flow. Then we route this data flow back into the main flow with a Union All task and carry on with normal processing. As part of our initialization of our dimension, we inserted an Unknown member with a known surrogate key value. For our example, we set the surrogate key value to zero. Figure 7-10 shows the data flow.

Figure 7-10. Data flow for handling legitimately missing members

Adding Row Counts for Auditing

One of the simplest and most effective audit controls you can have on the success of an ETL operation is to compare the number of rows input and the number of rows output. To do this effectively, you must have a simple query for the input to avoid the possibility that a join omits a row from the result set. Instead of joins, you would use a Lookup transform.

You use a Row Count transform to assign the number of rows that flowed between two transforms to a package variable. By using two or more Row Count transforms at the beginning and ends of the data flow task, you can determine whether the task has dropped any rows, or how many rows have been redirected for error handling, as shown in Figure 7-11. To track this information, you can insert a task to the control flow immediately after the data flow task that writes the package variables along with the package name and execution date to an audit table.

Figure 7-11. Auditing using the Row Count transform

Preventing Bad Data at the Source

One of the main sources of bad data is manually entered data. The business keys are rarely entered correctlyand why should a user have to remember the exact spelling of a business key in the first place?

If you are expecting users to enter data such as budgets into a spreadsheet, you can reduce their frustration and increase data quality by supplying them with a spreadsheet with the business keys already entered on the sheet, in protected columns. You can create the spreadsheet by either using MS Query from within Excel to directly populate the spreadsheet from a dimension table or use Reporting Services to create the empty report (except for the keys) and then save it as an Excel spreadsheet. You can also consider populating drop-down lists from the dimension table to assist the user in choosing the correct key.

In our example, sales forecasts are prepared at the Category level. To create a spreadsheet with the business key, the category name, and a column for the forecast amount, we created a report in Reporting Services based on a simple query of the DimProductCategory table, as shown here:

SELECT ProductCategoryBusinessKey AS CategoryID, EnglishProductCategoryName AS Name, CAST(0 AS money) AS [Q1 Sales Estimate], CAST(0 AS money) AS [Q2 Sales Estimate], CAST(0 AS money) AS [Q3 Sales Estimate], CAST(0 AS money) AS [Q4 Sales Estimate] FROM DimProductCategory

In the Report Wizard, all columns were added to the Detail level of the report. We used a Table style report. No subtotals were requested. The name of the report will be used for the sheet name, which you need to supply to the Excel data source when you import the completed spreadsheet. Figure 7-12 shows an example of the report we will export to Excel. You might want to add additional dimension members to the report (such as year, territory, salesperson), as necessary, for your application.