Practical Business Intelligence with SQL Server 2005

SQL Server Integration Services is the tool we will use to bring in our dimension data and our sales and budget facts from the source systems. Integration Services will also be used to transform the data into the shape we want before it is loaded into our data warehouse.

At a high level, all you want to do is copy some data from one or more data sources to the data warehouse. However, we are at the point now where you need to specify in detail just how to do that, and there is more to this than simply copying the data.

The sales and budget facts depend on the dimension data, so we must load the dimensions first. For simplicity, we have elected to reload all the data in our dimension and fact tables rather than trying to determine what has changed. Our data volume is small enough that we can get away with this. We address how you can load only the changes to the dimensions and facts in Chapter 8.

We need to accomplish several things in our ETL process. We need to load the dimensions, and then the sales facts can be loaded. Annual forecasts are prepared toward the end of each year, and these facts are loaded independently of the other dimensions or facts. When the data loading is completed, we need to process the dimensions and cubes in Analysis Services.

Getting Started with Integration Services

We are going to create an Integration Services project to load our dimensions and facts into the data warehouse. The sequence of tasks and transforms you define in Integration Services are saved in packages, which can reside in SQL Server or the file system. We like to create modular packages that have a single purpose, instead of creating very large packages that do everything. Later, we will add a package for loading the sales data, and another to load the budgets. We strongly encourage the use of integrated source control, such as Visual Source Safe 2005 or Visual Studio Team System, to prevent multiple developers from inadvertently modifying the same package.

Data Sources and Destinations

An Integration Services project usually needs one or more sources of data and a destination for the transformed data. In our project, we want to populate our data warehouse with data from a number of data sources: the dimension data and the sales from our on-line transaction processing (OLTP) systems, and the budgets from spreadsheets. Integration Services uses connection managers to connect data with tasks and transforms in Integration Services. Connection managers specify a database in an instance of SQL Server or other relational database such as Oracle or DB2, a file, a spreadsheet, or one of several other data sources. We find it useful to define these as soon as we create a package. We'd like to clarify a potential area of confusion about Data Source objects. These are not connection managers, but you can create a connection manager from a data source object. The advantage of a Data Source object is that it is visible to all packages in a project, and any connection managers based on a Data Source object are synchronized. If you change where a Data Source object points, all connection managers based on that data source will change to the new location. Note that this only applies during design. It is the connection manager that is exposed for configuration at runtime. This is discussed later in the section on deployment.

You don't need to have all your data in one source, or even one type of sourceit can come from any number of different sources. If you don't see a data provider for your particular data source, there is usually one available from third parties. A quick search for "ole db providers" using your favorite search engine should locate what you need.

For our customer, we'll need a connection to the data warehouse on SQL Server, a connection to each of the sales databases, and a connection to the file location where the Excel budget data is kept.

If you are looking down the road to deploying your solution to another environment, rest easy. You can dynamically change where a connection manager points to at runtime by assigning a variable to the appropriate property of the connection manager, such as the ServerName property. You can also set up an external runtime configuration that will set any property of a connection manager. We'll look at how to do that in a later section on deployment.

Quick Start: Setting Up Connection Managers

We're assuming at this point that you have set up your data warehouse in the previous chapter and that your source data is accessible via the network from the machine running SQL Server 2005. Our example uses data from a SQL Server database:

1.

In the BI Development Studio, select New Project from the File menu, and choose Integration Services Project from the Business Intelligence Projects subfolder.

2.

Name the project Manufacturing and click OK. Click Next to skip the first page.

3.

Create a new data source by right-clicking Data Sources in the Solution Explorer.

4.

Skip over the welcome page. Click New on the How to Define the Connection page.

5.

On the Connection Manager form, choose the provider (Native OLE DB\SQL Native Client for our example).

6.

Type or select the server name. Remember to specify the \Instance name if you are referring to a named instance of SQL Server.

7.

Choose the method to log on to the server. (We prefer Windows authentication.)

8.

Select a database to connect to that contains one or more of the source tables for the dimensions and click OK. Rename the connection manager to SalesTracking.

9.

Repeat this process to create a data source for the data warehouse. Note that in Step 8, you will choose the data warehouse database, rather than the source database, and rename it to DataWarehouse.

10.

At the bottom of the Control Flow tab, right-click in the Connection Managers area, select New Connection from Data Source, and choose the Data Source you just created.

Now we are ready to start defining our ETL process.

Tip: Be Consistent in Naming Connection Managers

When it comes time to deploy your solution from development to other environments, you will want to redirect your data sources and destinations. The configuration for each connection manager is usually stored by name in a SQL Server table or an XML file. If you use the same name in each Integration Services package for connection managers that refer to the same database, you will only have to change a few rows in a configuration table or an attribute in a few XML configuration files. Note that connection manager names are case sensitive.

Loading the Dimensions

We are going to create one package for each dimension we want to load. All these packages will be part of the project we just created.

Integration Services explicitly separates the design of the process flow from the design of the data flow. This separation simplifies the creation of ETL processes, and makes the process you implement much more understandable to others who might have to maintain the application. You first define the sequence of events that need to take place in order to populate the data warehouse. This is called the Control Flow. Separately, you define how each table in the data warehouse is populated from the data sources and what transformations the data must go through before it is ready for the data warehouse. This is called the Data Flow. Each package has one control flow. You can have multiple data flows in a control flow, and you will have a data flow for each entity you are populating in the data warehouse.

Defining the Control Flow

We'll start by loading the Sales Territory dimension table. In this section, we create an Integration Services package you can use as a model for loading dimension tables from any source. We hasten to point out that we will deal with what are called slowly changing dimensions (SCDs) in a separate chapter because of the extra complexity needed to handle them. Right now, we're just going to ease into loading what we will assume to be static dimensions. This is appropriate for simple dimensions such as status, condition, or gender. It is also appropriate for situations where you don't care about the historical values of a dimension and will be reloading all your data when you update the data warehouse, or for when you are just working on a proof of concept. In our experience, most applications have an implicit requirement to maintain history, so you should plan on accommodating this in your designs.

Because our control flow for loading a dimension consists of just copying data from a source to a destination, simply drag a Data Flow task from the Toolbox onto the surface of the Control Flow tab, as shown in Figure 4-4. Click the label to rename the Data Flow task to Load Sales Territories.

Figure 4-4. Initial control flow

Finally, you are ready to build the piece of the puzzle that you originally set out to do: moving some data into the data warehouse.

Defining the Data Flow

Data flow processing is initiated when a Data Flow task is executed in a control flow. In a Data Flow task, for one entity in the data warehouse, you define the flow of data from source to destination and the transforms needed along the way. The other streams required to populate other tables in the data warehouse will be handled by other Data Flow tasks. You can apply as many transformations as needed to prepare the data for the data warehouse. In this initial example, we just copy a subset of the columns from the source to the data warehouse.

Quick Start: Configuring a Simple Data Flow

Now we'll continue on with designing a package to load a dimension. In this example, we copy a subset of the columns of the Sales Territory table from our OLTP source system into a table in our data warehouse.

1.

Drag a Data Flow task from the Toolbox onto the surface of the Control Flow tab, as shown in Figure 4-4. Click the label to rename the Data Flow task to Load Sales Territories.

2.

Double-click the new Data Flow task to open the data flow design surface. Drag an OLE DB Source from the Toolbox onto the pane and change the name to Sales Territory.

3.

Drag an OLE DB Destination data flow destination from the Toolbox onto the pane and change the name to DW SalesTerritory.

4.

Click Sales Territory and drag the green arrow onto DW Sales Territory. You can't configure the data destination properly until it has data flowing into it. You will notice in the Sales Territory data flow source that there is a small red circle with an X in it. This is because it is missing some information. You can hover the mouse over the X to see what the problem is. We need to tell it where to find the source data, which table it is in, and which columns we want to copy.

5.

Double-click the Sales Territory OLE DB Source to open the OLE DB Source Editor.

6.

Set the OLE DB Connection Manager setting to SalesTracking.

7.

Set the data access mode to Table or view, and select [Sales].[Sales Territory] from the list of tables. Click Columns.

8.

We only want to copy TerritoryID, Name, CountryRegionCode, and Group, so check those columns. Uncheck the others.

9.

Double-click the DW Sales Territory SQL Server destination to open the OLE DB Source Editor.

10.

Set the Connection Manager to DataWarehouse.

11.

Select [dbo].[dimSalesTerritory] from the list of tables.

12.

Click Mappings.

13.

Drag and drop each input column onto the corresponding destination column. The TerritoryID input column is mapped to the business key, not to the SalesTerritoryKey. SalesTerritoryKey is an internally generated surrogate key. (The editor does try to map the columns based on the name, but none of our names match. If there are any automatic mappings, it is a good idea to review them.)

14.

Click OK to save the configuration. This would be a good time to save the whole solution, too.

Your data flow should now look like Figure 4-5.

Figure 4-5. Basic data flow

Tip: SQL Server Destinations Only Work on a Local Server

We did not choose the SQL Server Destination in the Quick Start because it will only work when the database is on the same server that the package is running on. If you know you will always be running on the same server, you should definitely choose the SQL Server Destination, because it is faster than the OLE DB Destination.

Testing Your Package

To try out your package, simply click the green Play button (or press F5, or choose Debug/Start). The data flow between the source and destination will show you how many rows were sent to the destination. You should see the task turn green when it has completed. The package is still in debug mode. To continue to work with the package, you need to stop debugging by pressing Shift+F5 or choosing Debug/Stop debugging.

You're so excited that it actually worked, you call the boss over to see it go, and press Play again. After what seems to be an interminable delay, the source task turns green, but the destination turns red. It didn't work the second time. You claim it is just the demonstration effect, but deep down you know something is missing.

Where do you look to see what went wrong? Click the tab labeled Progress to see a detailed list of the results of each step, as shown in Figure 4-6. Scroll down to find the row with an exclamation mark and hover the mouse over the line to view the entire error message. The message autohides after a short time. You can right-click it, copy it to the clipboard, and paste it into Notepad to make it easier to read the message.

Figure 4-6. Examining the progress log

In the message, you'll see that there is a violation of a unique key constraint. The business key is required to be unique, and that's a good thing. We just tried to load the same data twice. One of the constraints prevented an accidental duplication of the dimensional data, but we need to be able to run this task more than once! We need to change the control flow so that the table is emptied before we reload.

Constraints and the Data Warehouse

We strongly encourage you to implement constraints (such as uniqueness) at the beginning of the project. This is the time when the system is most exposed to data errors, so why not catch them early? It helped us detect a problem very early in a recent example in this chapter. You can spend a considerable amount of time wondering why the answers are wrong when it's really the data in your warehouse that's wrong. How to validate the data is discussed further in Chapter 7, "Data Quality."

Making Your Package Repeatable

The problem with the task we just created is that the data in the destination is not overwritten. Instead, we are inserting additional data even if it's the same data. A simple fix for this is to delete the data at the destination just before we insert the new data. All we need to do is issue a Truncate Table SQL statement. We can do this by modifying the control flow, adding a simple Execute SQL task before the Load Sales Territory task.

Quick Start: Working with Execute SQL Tasks

In this example, we truncate the destination table by adding a SQL statement to the control flow. This will prevent the same rows from being added twice to the table:

1.

Drag an Execute SQL task onto the Control Flow tab, above the Data Flow task.

2.

Rename this task Empty Sales Territories.

3.

Double-click the Empty Sales Territories task on the Control Flow tab to bring up the Execute SQL Task Editor.

4.

On the General tab of the editor, verify that the ResultSet property is set to None, the Connection Type is OLE-DB, and the SQLSourceType is Direct input.

5.

Click in the Connection property and choose the DataWarehouse connection manager from the list.

6.

In the SQL Statement property, enter the following SQL command: TRuncate table dbo.DimSalesTerritory

7.

Click Parse Query to check for syntax errors. Note that this does not check to see whether the tables and columns exist! Your query may still fail at runtime if you made a spelling mistake in the table name.

8.

Click OK to save the definition of this task.

9.

Set the execution order of the two tasks by dragging the green arrow from the Empty Sales Territories task onto the Load Sales Territories task. This causes the Load Sales Territories task to wait for the successful completion of the Empty Sales Territories task before beginning execution. You can right-click the green arrow and set the condition for starting to failure of the previous task or any completion status.

Your new control flow will now look like Figure 4-7.

Figure 4-7. Final control flow including Execute SQL task

Now you can bring back the boss and click Play with confidence. The table will be emptied first, and then repopulated with the current contents of the source, so there will be no problem rerunning this task.

Inspecting the Data Flow

Sometimes, your transforms don't work as you expected, and the problem is more complex that forgetting to empty a table. If you want to see what data is flowing between two data flow components, you can add a Data Viewer to monitor the output of a task. Simply right-click the flow line you want to examine, and click Add. Choose Grid to show a table of the rows being passed between the data flow components.

The Data Viewer shows you a batch of rows, not a continuous stream of data. The size of the batch depends on the size of the rows. The data flow pauses at the end of a batch, allowing you to inspect the data in that batch. Press the Play button (green arrow) to continue.

Completing the Dimensions

You've now got a general idea how a simple data flow is built. You can now go ahead and build packages for loading the rest of the dimensions. Use the same Integration Services project for these dimensions; just create a new package by right-clicking the Packages node in the Solution Explorer pane. After you have the dimensions in place, you can load the facts.

Loading the Fact Table

Unlike our dimension tables, the source records for the Shipments facts come from more than one table. The sales data is in two major tables: the SalesHeader and the SalesDetail tables. We also need to include the ProductCostHistory table so that we'll know how much the items cost when they were shipped. The header contains the business keys for sales-person, territory, the dates, and the customer. The detail table contains the business keys for product and special offer, as well as the measures for quantity shipped, unit price, and discount. To answer questions such as "Where do we ship each of our products?" and "What was our cost for the products we shipped?," we need to have columns from all three source tables in our Shipments fact table.

The other transformation that we need to perform is to translate the business keys in the incoming source records to the corresponding surrogate keys used by the dimensions.

Working with Multiple Source Tables

You have several choices about how to create this denormalized view of your shipments that joins the three source tables together. You could create a view in the original data source and directly query this view through the data source in your data flow. But often, you can't make any changes or additions to the source database, so you need to have an alternative method for loading this data. You can either load each table independently and denormalize later with a view, or you use a SQL query in your data source that joins the tables you need. For demonstration purposes, we're going to load from multiple tables using a SQL query into a denormalized table.

In our previous example where we loaded a dimension, the OLE DB data source Data Access Mode property was set to Table or view. To specify a query that we will build ourselves, you need to set the Data Access Mode to SQL Command. If you click the Query Builder button, you will see a fairly familiar graphical query designer. Use the Add Table icon (plus sign over a grid) to select the three tables we need (SalesOrderHeader, SalesOrderDetail, and ProductCostHistory) and create the joins by dragging SalesOrderID from the Header to the Detail table, and ProductID from the Detail to the CostHistory table. Finally, qualify which time frame in CostHistory we are referring to by adding a where clause:

WHERE OrderDate >= ProductCostHistory.StartDate and OrderDate < ProductCostHistory.EndDate

Check off the columns we want to use. (We won't talk you through checking off each one; you can decide what you need.) Now you can click OK twice, and you have a multi-table data stream in your data flow (see Figure 4-8).

Figure 4-8. Creating a multiple-table query in a data source

Looking Up the Dimension Keys

The next step is to translate the business keys in the data stream from our data source into the surrogate keys used by the dimension tables in the data warehouse. The process is simple: Using the business key, look up the corresponding row in the dimension table. Sounds like a join, so why not use a join? We want to avoid hardcoding any database names in a query. Another reason is to ensure data quality by ensuring that we have a matching dimension member. We discuss this in Chapter 7. Another reason is to avoid hardcoding any references to databases on other servers. You would not be able to change the server name to use the package in another environment without editing the package. This only invites maintenance headaches.

Integration Services has a Lookup transform that can translate business keys into surrogate keys, and also meets both criteria of allowing the dimension table to be in a separate database from the source fact table, and it will assist us with data quality. You use one Lookup transform for each business key you are translating, so you will end up with a series of lookups in the data flows of the packages where you are populating a fact table. For each lookup transform, we need to identify which business key we want to translate into a surrogate key, which database and table contains the dimension table that maps these two keys, and finally which column in the dimension table is the surrogate key we want to return to the data stream.

To configure a Lookup transform to translate business keys into surrogate keys, drag a Lookup transform from the Toolbox onto the data flow surface, and connect the source data stream to it. Rename it to something descriptive such as Lookup Product. Double-click the transform to edit it. You need to specify a connection manager that points to the database containing the reference table, which is one of our dimension tables in the data warehouse. Next, select the reference table from the Use a table or view list.

On the Columns tab, specify which columns in the data flow are to be used to uniquely specify a row in the reference table. Just drag one or more columns from the available input columns onto the Available Lookup Columns in the reference table. This is shown in Figure 4-9.

Figure 4-9. Configuring a Lookup transform

Finally, check one ore more column from the Available Lookup Columns that you want to have joining the data flow. This is usually the surrogate key column in the dimension table.

A word of caution applies to Lookup transforms. The matching columns or join criteria is case sensitive, unlike a join in SQL. You might need to create a derived column to set the case of the business keys to a known statefor example, all uppercase.

Add a Lookup transform for each of the other dimensions using the same pattern we used for the Product dimension.

To complete the data flow, drag an OLE DB Data Destination from the Toolbox onto the dataflow surface. Rename this destination to Sales Data. Reuse the Data Warehouse connection manager and set the table to your sales data table. (If you haven't created this table in the data warehouse yet, you can easily do so by clicking the New button beside the table field.)

You should end up with a data flow that looks like Figure 4-10.

Figure 4-10. Fact table data flow

Loading the Budget Information from Excel

The previous examples loaded data from relational tables. Now, let's look at loading data from an Excel spreadsheet. We want to be able to import the budget figures into the data warehouse so that we can compare them with the actual results.

The company creates sales forecasts for each of the four quarters in a year, for each region and product. These forecasts are at the product level because they are used for plant capacity planning as well as revenue forecasts.

Budgeting will need its own package. You add a new package to a project by right-clicking on Packages in the solution explorer, and choosing New Package. Rename the package to Load Budgets. You will follow the same pattern of adding a data flow task to the control flow, along with any other tasks you need, such as one to empty the budget table in the data warehouse.

For the data flow, you start with an Excel data source. When you edit the data source and click New to create the connection manager, you will see that the dialogue is different than for a database source. You browse to the Excel file containing your budget, and specify which sheet in the workbook you want to use.

Excel Layouts for Importing Data

Users like to see some things on a spreadsheet that don't work well when you are trying to import data. For spreadsheets that we will be importing into the data warehouse, we only have a single row of headings because the Excel connection cannot handle more than one row of headings.

Empty rows left at the bottom of the spreadsheet are another thing that can cause problems. They will appear as rows with null values. To get around this, in the Excel data source we can add a filter "where Region is not null." To implement this, we needed to switch the data access mode from Table or view to SQL Command, and build a SQL query by clicking Build.

Matching Source and Destination Data Types

One important thing you need to do is set the true data types of each column. Excel chooses a type for each column based on an examination of the first eight rows. That choice doesn't always match the data warehouse requirements. For example, numeric columns come in as float, whereas general comes in as nvarchar(255). We recommend using a Data Conversion transform right after the Excel data source to set the data types to what you really need.

Reshaping the Data

Fact tables, such as the budget forecasts we are working with, need to have exactly one occurrence of each measure per row, along with one member of each dimension. This isn't always the way people like to see the information presented to them on a spreadsheet.

Looking back at the spreadsheet we are importing, shown in Figure 4-2, you see that the rows aren't normalized. Four forecasts for each region and product category repeat on each row; and the time dimension is not a value in the row, it is a column name. We really need the forecast on one row for each quarter. This will allow us to create a cube to compare the forecasts with the actual revenue. Fortunately, we can use an Unpivot transform to transpose these repeating fields onto a row of their own. In our case, one row from the spreadsheet will become four rows after the Unpivot transform.

Quick Start: Working with the Unpivot Transformation

To use the Unpivot transform, you need to specify which columns should be repeated on each row (passed through) and which columns are the repeated columns of the row and should generate a row for each of the columns. You also need to provide a value for a new column that denotes which column is on the new row:

1.

Drag an Unpivot transform onto the data flow pane, below the Excel source.

2.

Connect the Excel Source to the Unpivot transform.

3.

Double-click the Unpivot transform to open the editor to set the configuration.

4.

Check Q1, Q2, Q3, and Q4 for the columns to unpivot. These names will also be used in the Pivot Values, which will be an additional column used to identify which column a new row represents. Think of the names as an additional component in the compound key for this row. We change them later in Step 7.

5.

Uncheck the Description column from the Pass Through check box. We don't need the description, because we have the description code. It was just there to help the users understand which product they were budgeting for. Columns Q1 to Q4 are automatically passed through to the output stream, on separate rows.

6.

Set the Destination Column for each row to Units. This is the name of the new column that the value in Q1 to Q4 will be placed in. You could use different names for each column, but that isn't appropriate for normalizing this spreadsheet.

7.

Set the Pivot Value column to 1, 2, 3, and 4, corresponding to Q1 to Q4.

8.

Set the Pivot Value Column Name to Quarter. This column will take on the Pivot Values corresponding to columns Q1 to Q4 we set in Step 7, based on the column that is on the current row.

9.

Click OK to save the configuration.

In the Unpivot transformation, we created a new column for Quarter. Whenever a transform creates a new column, check that the column has a data type compatible with the column it is mapped to in the next step in the data flow. You can check and set the data type for a new column created in a transform by using the Input and Output Properties tab on the Advanced Editor for that transform. Start the Advanced Editor by right-clicking the transform. The Unpivot transform configuration should look like Figure 4-11.

Figure 4-11. Transposing column values to rows

We have a couple of small tasks remaining before we can save the data into the warehouse. We need to use the Lookup task to find the Territory surrogate key for the region in the budget, and another Lookup task to find the product record to give us the product surrogate key, unit price, and unit cost. Finally, we want to compute the total price and cost for this product for this region and quarter.

Adding New Columns for Total Cost and Price

To create the two new columns for the total cost and total price, we'll use a Derived Column transform. This transform enables you to specify a name for the new column, define an expression for the value of the column, and to set the data type. Drag a derived column transform onto the data flow panel and give it a name like Compute cost and price.

Double-click the Derived Column transform to open it for editing. Enter a new column name for the derived column, such as TotalCost.

Provide the formula for the column. In our example, drag Budget Units from the available columns into the Expression field, enter an asterisk (*), and then drag in Product Cost. Check that the data type is correct. It should be Currency. You can create multiple new columns in one Derived Column transform, so just repeat this procedure for BudgetRevenue. Figure 4-12 shows how you would configure a derived column transform to create the new columns.

Figure 4-12. Adding new columns with a Derived Column transform

Saving the Budget into the Data Warehouse

The output of the Compute cost and price transform is now ready to be saved to the data warehouse. This output contains all the columns we've defined. You have done something similar when you saved the SalesTerritory dimension into the data warehouse. All you need to do is drop an OLE DB Destination onto the data flow, drag the green arrow from the Derived transform onto the destination, and map the columns from the stream onto the destination table. Your data flow for budget loading should now look like Figure 4-13.

Figure 4-13. Final budget data flow

Loading Multiple Sets of Data

In our example, we have individual budget spreadsheets submitted from each region, so there will many spreadsheets and we don't want to have to change our package to specify the name of each spreadsheet. You can use the For Each Loop Container task to enumerate the file names in a directory that match a specific pattern. If we establish a convention that all our budgets will have filenames that start with Budget and have .XLS extensions, and we put them in a fixed directory, we can easily load every budget regardless of any other distinguishing parts of the name. For example, the name for the Canadian budget for fiscal year 2006 would be Budget-CAD-2006.XLS. This fits the pattern and distinguishes it from the Southwest budget.

Using Variables in a Package

The For Each loop gives us back a list of filenames; but how do we work with each filename, one at a time? We need a way to take one filename off the list and pass it to a data source so that it works on a file from the list, not the one we coded into the package at design time. This is what variables are for.

Variables are objects you can store values in for use in calculations, expressions, and passing information between tasks and other packages. Variables have scope. Scope defines how widely known the variable will be. Package scope means that the variable will be known to every task in the package, and also inside every data flow invoked by the package. If the Data Flow tab had been selected rather than the Control Flow tab, the variable would have had the scope only of the Data Flow task and would not be visible to the For Each task. This prevents accidental name collisions, but can also cause you a bit of confusion when you try to find the variable if you have accidentally created a variable with the wrong scope.

Quick Start: Processing a Set of Files

1.

Go to the Control Flow tab, select Variables from the Integration Services menu to show the Variables window, and click the Add Variable icon. Give the variable a name, such as BudgetFileName. Select String as the data type.

2.

You want to specify an initial Value for this variable that points to a valid Excel spreadsheet so that you can work with the data flow in design mode. Without a value, the Excel data source will complain; so, specify a full path such as C:\Budget-CAD-2006.xls.

3.

Drag and drop a new For Each Loop Container onto the Control Flow tab, and double-click it to configure the properties.

4.

On the Collection section, specify the folder name that contains your budget spreadsheets and the filenames to search for (such as Budget*.xls).

5.

On the Variable Mappings section, select the variable created in Step 1 (BudgetFileName) from the list and specify 0 as the Index to map. As the For Each loop iterates through the filenames, each will be assigned one at a time to the variable. Click OK.

6.

Drag your existing data flow task on top of the For Each Loop Container. If the Data Flow task has a preceding task, such as one to empty the table, you must disconnect the two tasks and make the connection to the For Each Loop Container. This will let Integration Services know that the data flow needs to be executed for every file that is retrieved in the loop.

7.

You need to configure the Excel connection to pick up the filename from the variable. Right-click the Excel Connection Manager in the Connection Managers area at the bottom of the package designer and open the Properties window. Click the ellipsis button (...) next to the Expressions property.

8.

Select the ExcelFilePath property and drag the variable BudgetFileName into the expression area. Click OK.

Your control flow for loading the budgets should now look like what is shown in Figure 4-14.

Figure 4-14. Control flow for iterating budget files

Категории