Practical Business Intelligence with SQL Server 2005
SQL Server Integration Services is the tool we will use to bring in our dimension data and our sales and budget facts from the source systems. Integration Services will also be used to transform the data into the shape we want before it is loaded into our data warehouse. At a high level, all you want to do is copy some data from one or more data sources to the data warehouse. However, we are at the point now where you need to specify in detail just how to do that, and there is more to this than simply copying the data. The sales and budget facts depend on the dimension data, so we must load the dimensions first. For simplicity, we have elected to reload all the data in our dimension and fact tables rather than trying to determine what has changed. Our data volume is small enough that we can get away with this. We address how you can load only the changes to the dimensions and facts in Chapter 8. We need to accomplish several things in our ETL process. We need to load the dimensions, and then the sales facts can be loaded. Annual forecasts are prepared toward the end of each year, and these facts are loaded independently of the other dimensions or facts. When the data loading is completed, we need to process the dimensions and cubes in Analysis Services. Getting Started with Integration Services
We are going to create an Integration Services project to load our dimensions and facts into the data warehouse. The sequence of tasks and transforms you define in Integration Services are saved in packages, which can reside in SQL Server or the file system. We like to create modular packages that have a single purpose, instead of creating very large packages that do everything. Later, we will add a package for loading the sales data, and another to load the budgets. We strongly encourage the use of integrated source control, such as Visual Source Safe 2005 or Visual Studio Team System, to prevent multiple developers from inadvertently modifying the same package. Data Sources and Destinations
An Integration Services project usually needs one or more sources of data and a destination for the transformed data. In our project, we want to populate our data warehouse with data from a number of data sources: the dimension data and the sales from our on-line transaction processing (OLTP) systems, and the budgets from spreadsheets. Integration Services uses connection managers to connect data with tasks and transforms in Integration Services. Connection managers specify a database in an instance of SQL Server or other relational database such as Oracle or DB2, a file, a spreadsheet, or one of several other data sources. We find it useful to define these as soon as we create a package. We'd like to clarify a potential area of confusion about Data Source objects. These are not connection managers, but you can create a connection manager from a data source object. The advantage of a Data Source object is that it is visible to all packages in a project, and any connection managers based on a Data Source object are synchronized. If you change where a Data Source object points, all connection managers based on that data source will change to the new location. Note that this only applies during design. It is the connection manager that is exposed for configuration at runtime. This is discussed later in the section on deployment. You don't need to have all your data in one source, or even one type of sourceit can come from any number of different sources. If you don't see a data provider for your particular data source, there is usually one available from third parties. A quick search for "ole db providers" using your favorite search engine should locate what you need. For our customer, we'll need a connection to the data warehouse on SQL Server, a connection to each of the sales databases, and a connection to the file location where the Excel budget data is kept. If you are looking down the road to deploying your solution to another environment, rest easy. You can dynamically change where a connection manager points to at runtime by assigning a variable to the appropriate property of the connection manager, such as the ServerName property. You can also set up an external runtime configuration that will set any property of a connection manager. We'll look at how to do that in a later section on deployment.
Tip: Be Consistent in Naming Connection Managers When it comes time to deploy your solution from development to other environments, you will want to redirect your data sources and destinations. The configuration for each connection manager is usually stored by name in a SQL Server table or an XML file. If you use the same name in each Integration Services package for connection managers that refer to the same database, you will only have to change a few rows in a configuration table or an attribute in a few XML configuration files. Note that connection manager names are case sensitive.
Loading the Dimensions
We are going to create one package for each dimension we want to load. All these packages will be part of the project we just created. Integration Services explicitly separates the design of the process flow from the design of the data flow. This separation simplifies the creation of ETL processes, and makes the process you implement much more understandable to others who might have to maintain the application. You first define the sequence of events that need to take place in order to populate the data warehouse. This is called the Control Flow. Separately, you define how each table in the data warehouse is populated from the data sources and what transformations the data must go through before it is ready for the data warehouse. This is called the Data Flow. Each package has one control flow. You can have multiple data flows in a control flow, and you will have a data flow for each entity you are populating in the data warehouse. Defining the Control Flow
We'll start by loading the Sales Territory dimension table. In this section, we create an Integration Services package you can use as a model for loading dimension tables from any source. We hasten to point out that we will deal with what are called slowly changing dimensions (SCDs) in a separate chapter because of the extra complexity needed to handle them. Right now, we're just going to ease into loading what we will assume to be static dimensions. This is appropriate for simple dimensions such as status, condition, or gender. It is also appropriate for situations where you don't care about the historical values of a dimension and will be reloading all your data when you update the data warehouse, or for when you are just working on a proof of concept. In our experience, most applications have an implicit requirement to maintain history, so you should plan on accommodating this in your designs. Because our control flow for loading a dimension consists of just copying data from a source to a destination, simply drag a Data Flow task from the Toolbox onto the surface of the Control Flow tab, as shown in Figure 4-4. Click the label to rename the Data Flow task to Load Sales Territories. Figure 4-4. Initial control flow
Finally, you are ready to build the piece of the puzzle that you originally set out to do: moving some data into the data warehouse. Defining the Data Flow
Data flow processing is initiated when a Data Flow task is executed in a control flow. In a Data Flow task, for one entity in the data warehouse, you define the flow of data from source to destination and the transforms needed along the way. The other streams required to populate other tables in the data warehouse will be handled by other Data Flow tasks. You can apply as many transformations as needed to prepare the data for the data warehouse. In this initial example, we just copy a subset of the columns from the source to the data warehouse.
Tip: SQL Server Destinations Only Work on a Local Server We did not choose the SQL Server Destination in the Quick Start because it will only work when the database is on the same server that the package is running on. If you know you will always be running on the same server, you should definitely choose the SQL Server Destination, because it is faster than the OLE DB Destination. Testing Your Package
To try out your package, simply click the green Play button (or press F5, or choose Debug/Start). The data flow between the source and destination will show you how many rows were sent to the destination. You should see the task turn green when it has completed. The package is still in debug mode. To continue to work with the package, you need to stop debugging by pressing Shift+F5 or choosing Debug/Stop debugging. You're so excited that it actually worked, you call the boss over to see it go, and press Play again. After what seems to be an interminable delay, the source task turns green, but the destination turns red. It didn't work the second time. You claim it is just the demonstration effect, but deep down you know something is missing. Where do you look to see what went wrong? Click the tab labeled Progress to see a detailed list of the results of each step, as shown in Figure 4-6. Scroll down to find the row with an exclamation mark and hover the mouse over the line to view the entire error message. The message autohides after a short time. You can right-click it, copy it to the clipboard, and paste it into Notepad to make it easier to read the message. Figure 4-6. Examining the progress log
In the message, you'll see that there is a violation of a unique key constraint. The business key is required to be unique, and that's a good thing. We just tried to load the same data twice. One of the constraints prevented an accidental duplication of the dimensional data, but we need to be able to run this task more than once! We need to change the control flow so that the table is emptied before we reload.
Making Your Package Repeatable
The problem with the task we just created is that the data in the destination is not overwritten. Instead, we are inserting additional data even if it's the same data. A simple fix for this is to delete the data at the destination just before we insert the new data. All we need to do is issue a Truncate Table SQL statement. We can do this by modifying the control flow, adding a simple Execute SQL task before the Load Sales Territory task.
Now you can bring back the boss and click Play with confidence. The table will be emptied first, and then repopulated with the current contents of the source, so there will be no problem rerunning this task. Inspecting the Data Flow
Sometimes, your transforms don't work as you expected, and the problem is more complex that forgetting to empty a table. If you want to see what data is flowing between two data flow components, you can add a Data Viewer to monitor the output of a task. Simply right-click the flow line you want to examine, and click Add. Choose Grid to show a table of the rows being passed between the data flow components. The Data Viewer shows you a batch of rows, not a continuous stream of data. The size of the batch depends on the size of the rows. The data flow pauses at the end of a batch, allowing you to inspect the data in that batch. Press the Play button (green arrow) to continue. Completing the Dimensions
You've now got a general idea how a simple data flow is built. You can now go ahead and build packages for loading the rest of the dimensions. Use the same Integration Services project for these dimensions; just create a new package by right-clicking the Packages node in the Solution Explorer pane. After you have the dimensions in place, you can load the facts. Loading the Fact Table
Unlike our dimension tables, the source records for the Shipments facts come from more than one table. The sales data is in two major tables: the SalesHeader and the SalesDetail tables. We also need to include the ProductCostHistory table so that we'll know how much the items cost when they were shipped. The header contains the business keys for sales-person, territory, the dates, and the customer. The detail table contains the business keys for product and special offer, as well as the measures for quantity shipped, unit price, and discount. To answer questions such as "Where do we ship each of our products?" and "What was our cost for the products we shipped?," we need to have columns from all three source tables in our Shipments fact table. The other transformation that we need to perform is to translate the business keys in the incoming source records to the corresponding surrogate keys used by the dimensions. Working with Multiple Source Tables
You have several choices about how to create this denormalized view of your shipments that joins the three source tables together. You could create a view in the original data source and directly query this view through the data source in your data flow. But often, you can't make any changes or additions to the source database, so you need to have an alternative method for loading this data. You can either load each table independently and denormalize later with a view, or you use a SQL query in your data source that joins the tables you need. For demonstration purposes, we're going to load from multiple tables using a SQL query into a denormalized table. In our previous example where we loaded a dimension, the OLE DB data source Data Access Mode property was set to Table or view. To specify a query that we will build ourselves, you need to set the Data Access Mode to SQL Command. If you click the Query Builder button, you will see a fairly familiar graphical query designer. Use the Add Table icon (plus sign over a grid) to select the three tables we need (SalesOrderHeader, SalesOrderDetail, and ProductCostHistory) and create the joins by dragging SalesOrderID from the Header to the Detail table, and ProductID from the Detail to the CostHistory table. Finally, qualify which time frame in CostHistory we are referring to by adding a where clause: WHERE OrderDate >= ProductCostHistory.StartDate and OrderDate < ProductCostHistory.EndDate
Check off the columns we want to use. (We won't talk you through checking off each one; you can decide what you need.) Now you can click OK twice, and you have a multi-table data stream in your data flow (see Figure 4-8). Figure 4-8. Creating a multiple-table query in a data source
Looking Up the Dimension Keys
The next step is to translate the business keys in the data stream from our data source into the surrogate keys used by the dimension tables in the data warehouse. The process is simple: Using the business key, look up the corresponding row in the dimension table. Sounds like a join, so why not use a join? We want to avoid hardcoding any database names in a query. Another reason is to ensure data quality by ensuring that we have a matching dimension member. We discuss this in Chapter 7. Another reason is to avoid hardcoding any references to databases on other servers. You would not be able to change the server name to use the package in another environment without editing the package. This only invites maintenance headaches. Integration Services has a Lookup transform that can translate business keys into surrogate keys, and also meets both criteria of allowing the dimension table to be in a separate database from the source fact table, and it will assist us with data quality. You use one Lookup transform for each business key you are translating, so you will end up with a series of lookups in the data flows of the packages where you are populating a fact table. For each lookup transform, we need to identify which business key we want to translate into a surrogate key, which database and table contains the dimension table that maps these two keys, and finally which column in the dimension table is the surrogate key we want to return to the data stream. To configure a Lookup transform to translate business keys into surrogate keys, drag a Lookup transform from the Toolbox onto the data flow surface, and connect the source data stream to it. Rename it to something descriptive such as Lookup Product. Double-click the transform to edit it. You need to specify a connection manager that points to the database containing the reference table, which is one of our dimension tables in the data warehouse. Next, select the reference table from the Use a table or view list. On the Columns tab, specify which columns in the data flow are to be used to uniquely specify a row in the reference table. Just drag one or more columns from the available input columns onto the Available Lookup Columns in the reference table. This is shown in Figure 4-9. Figure 4-9. Configuring a Lookup transform
Finally, check one ore more column from the Available Lookup Columns that you want to have joining the data flow. This is usually the surrogate key column in the dimension table. A word of caution applies to Lookup transforms. The matching columns or join criteria is case sensitive, unlike a join in SQL. You might need to create a derived column to set the case of the business keys to a known statefor example, all uppercase. Add a Lookup transform for each of the other dimensions using the same pattern we used for the Product dimension. To complete the data flow, drag an OLE DB Data Destination from the Toolbox onto the dataflow surface. Rename this destination to Sales Data. Reuse the Data Warehouse connection manager and set the table to your sales data table. (If you haven't created this table in the data warehouse yet, you can easily do so by clicking the New button beside the table field.) You should end up with a data flow that looks like Figure 4-10. Figure 4-10. Fact table data flow
Loading the Budget Information from Excel
The previous examples loaded data from relational tables. Now, let's look at loading data from an Excel spreadsheet. We want to be able to import the budget figures into the data warehouse so that we can compare them with the actual results. The company creates sales forecasts for each of the four quarters in a year, for each region and product. These forecasts are at the product level because they are used for plant capacity planning as well as revenue forecasts. Budgeting will need its own package. You add a new package to a project by right-clicking on Packages in the solution explorer, and choosing New Package. Rename the package to Load Budgets. You will follow the same pattern of adding a data flow task to the control flow, along with any other tasks you need, such as one to empty the budget table in the data warehouse. For the data flow, you start with an Excel data source. When you edit the data source and click New to create the connection manager, you will see that the dialogue is different than for a database source. You browse to the Excel file containing your budget, and specify which sheet in the workbook you want to use.
Matching Source and Destination Data Types
One important thing you need to do is set the true data types of each column. Excel chooses a type for each column based on an examination of the first eight rows. That choice doesn't always match the data warehouse requirements. For example, numeric columns come in as float, whereas general comes in as nvarchar(255). We recommend using a Data Conversion transform right after the Excel data source to set the data types to what you really need. Reshaping the Data
Fact tables, such as the budget forecasts we are working with, need to have exactly one occurrence of each measure per row, along with one member of each dimension. This isn't always the way people like to see the information presented to them on a spreadsheet. Looking back at the spreadsheet we are importing, shown in Figure 4-2, you see that the rows aren't normalized. Four forecasts for each region and product category repeat on each row; and the time dimension is not a value in the row, it is a column name. We really need the forecast on one row for each quarter. This will allow us to create a cube to compare the forecasts with the actual revenue. Fortunately, we can use an Unpivot transform to transpose these repeating fields onto a row of their own. In our case, one row from the spreadsheet will become four rows after the Unpivot transform.
In the Unpivot transformation, we created a new column for Quarter. Whenever a transform creates a new column, check that the column has a data type compatible with the column it is mapped to in the next step in the data flow. You can check and set the data type for a new column created in a transform by using the Input and Output Properties tab on the Advanced Editor for that transform. Start the Advanced Editor by right-clicking the transform. The Unpivot transform configuration should look like Figure 4-11. Figure 4-11. Transposing column values to rows
We have a couple of small tasks remaining before we can save the data into the warehouse. We need to use the Lookup task to find the Territory surrogate key for the region in the budget, and another Lookup task to find the product record to give us the product surrogate key, unit price, and unit cost. Finally, we want to compute the total price and cost for this product for this region and quarter. Adding New Columns for Total Cost and Price
To create the two new columns for the total cost and total price, we'll use a Derived Column transform. This transform enables you to specify a name for the new column, define an expression for the value of the column, and to set the data type. Drag a derived column transform onto the data flow panel and give it a name like Compute cost and price. Double-click the Derived Column transform to open it for editing. Enter a new column name for the derived column, such as TotalCost. Provide the formula for the column. In our example, drag Budget Units from the available columns into the Expression field, enter an asterisk (*), and then drag in Product Cost. Check that the data type is correct. It should be Currency. You can create multiple new columns in one Derived Column transform, so just repeat this procedure for BudgetRevenue. Figure 4-12 shows how you would configure a derived column transform to create the new columns. Figure 4-12. Adding new columns with a Derived Column transform
Saving the Budget into the Data Warehouse
The output of the Compute cost and price transform is now ready to be saved to the data warehouse. This output contains all the columns we've defined. You have done something similar when you saved the SalesTerritory dimension into the data warehouse. All you need to do is drop an OLE DB Destination onto the data flow, drag the green arrow from the Derived transform onto the destination, and map the columns from the stream onto the destination table. Your data flow for budget loading should now look like Figure 4-13. Figure 4-13. Final budget data flow
Loading Multiple Sets of Data
In our example, we have individual budget spreadsheets submitted from each region, so there will many spreadsheets and we don't want to have to change our package to specify the name of each spreadsheet. You can use the For Each Loop Container task to enumerate the file names in a directory that match a specific pattern. If we establish a convention that all our budgets will have filenames that start with Budget and have .XLS extensions, and we put them in a fixed directory, we can easily load every budget regardless of any other distinguishing parts of the name. For example, the name for the Canadian budget for fiscal year 2006 would be Budget-CAD-2006.XLS. This fits the pattern and distinguishes it from the Southwest budget. Using Variables in a Package
The For Each loop gives us back a list of filenames; but how do we work with each filename, one at a time? We need a way to take one filename off the list and pass it to a data source so that it works on a file from the list, not the one we coded into the package at design time. This is what variables are for. Variables are objects you can store values in for use in calculations, expressions, and passing information between tasks and other packages. Variables have scope. Scope defines how widely known the variable will be. Package scope means that the variable will be known to every task in the package, and also inside every data flow invoked by the package. If the Data Flow tab had been selected rather than the Control Flow tab, the variable would have had the scope only of the Data Flow task and would not be visible to the For Each task. This prevents accidental name collisions, but can also cause you a bit of confusion when you try to find the variable if you have accidentally created a variable with the wrong scope.
|