Practical Business Intelligence with SQL Server 2005
We'll start by looking at what changes you need to make to your cube design for a real-time solution and work our way back to how we load data through Integration Services. Note that the basis of this solution, proactive caching, is only available in the Enterprise Edition of SQL Server 2005. Cube Design for Real Time
In all the other chapters, we have implicitly or explicitly urged you to use MOLAP storage for your cubes and dimensions because of MOLAP's vastly superior performance over ROLAP. Remember that ROLAP mode simply references the relational database to answer queries and has no storage optimizations for the kinds of queries that are common in OLAP queries. However, you have to process a cube to load the relational data into MOLAP storage, and that step adds latency to making new data available to your users. With ROLAP storage, there is really no need to process the cube to make the data available, which is exactly what we are looking for in a real-time solution: present whatever data is available right now. So why not use ROLAP all the time? Because we can't afford the query performance hit if we have a large amount of historical data and complex dimensions, and this is why we recommend MOLAP. Wait, this is where we started.... What we really want is the best of both worlds. MOLAP performance for all the existing data and low-latency ROLAP for newly arrived data. Fortunately, we can now design a system that approaches these characteristics. With Analysis Services 2005, you have the services of proactive caching available to you, which when combined with a good partitioning scheme integrates the performance benefits of MOLAP aggregations with the low latency of ROLAP. To enable proactive caching, open the designer for the cube. Select the Partitions tab. You probably have only one partition right now, but if you have more than one, choose the one that stores the "current" data. Click Storage Settings, be sure that the Standard Settings button is selected, and move the slider left, from MOLAP to Automatic MOLAP (see Figure 12-3). Save these settings. Congratulations, you're done! Go browse your cube and note the value of one of the measures, add a row to your fact table, wait 10 seconds, and then refresh the display. The new data really is there, and you didn't even have to process the cube. Figure 12-3. Enabling proactive caching
Now that was too easy, you say, and there must be a catch. Well, there are some considerations, but no real catches. First, it's important to understand how proactive caching really works, and then you will have a better idea of the reasons for these considerations, and why there are not any big catches. Overview of Proactive Caching
Let's take a closer look at what we just did with proactive caching. Proactive caching sits between the user and the data they are after. In a static situation where MOLAP storage is used, a request for data is satisfied by the cache, which obtains all the data from the MOLAP store. Simple enough. When new data arrives in the relational store, the MOLAP cache is out of sync with the relational store. What are our options for dealing with the next request for data? We could simply continue to serve up data from the cache and in the background begin rebuilding the MOLAP store. The data would be out of date until the reprocessing was finished, but we would still be responding to requests quickly because we are using the cache. How stale the data becomes is a function of how long it takes to reprocess the data. If we need to have the most up-to-date view of the data, we could have the cache immediately invalidated when new data is received. Now our only source of data is the relational database, which means we are in ROLAP mode. This is the effect you get if you choose real-time HOLAP for the partition storage settings. ROLAP can be extremely slow for some queries. We could do something in between these two extremes. We could continue to use the cache for a while, but if processing took too long, we could then invalidate the cache and use ROLAP mode. If we can reprocess the data before the data is too stale, we avoid having to go to the poorer-performing ROLAP mode. Tuning Proactive Caching
You can configure proactive caching so that you recognize and wait for the completion of a steady stream of input, and at the same time don't exceed your maximum latency requirements. Consider the following scenario. When new data arrives in the data warehouse, the sooner we start processing it, the sooner it will be available. So, what's holding us back from processing this data? Well, what happens if more data arrives soon after we started processing? We have to start again, and that means we have to throw away the cycles we just consumed. Remember that we are not processing just the new data, but all the data in the partition. This could be your entire set! If we wait just a little while to see whether any more data is going to arrive, we could process it all at the same time. But how long should you wait? Avoiding Too Much Reprocessing
Suppose that a batch upload to the data warehouse is occurring. There will be little time between the arrival of each row, and we certainly wouldn't want to be restarting the processing after each row. So why don't we wait, say, ten seconds after a row arrives, and if nothing else arrives, then we can start processing. If another row does arrive, we start waiting for ten seconds again. The time we wait to see whether the activity has quieted down is called the silence interval. If no data arrives for the duration set by the silence interval, and there is new data, processing begins. You can set the value of the silence interval to any value from zero upward. Ten seconds is common. (That's why you had to wait 10 seconds before refreshing your browser when we first started looking at proactive caching.) If you set the silence interval to zero, processing starts immediately when data is received but might be interrupted by the arrival of more data. If you set it to a very large value, processing does not start at all until that time is up, and you might get excessively stale data. However, we do have a solution to the problem of false starts of the reprocessing task, and you can tune this to your situation. But what happens if data continues to trickle in? Dealing with a Continuous Backlog of New Data
If you have set the silence interval to ten seconds, on a bad day the data will arrive one row every nine seconds, and nothing will ever get processed because the silence interval criteria is never met. After a while, we need to be able to just say "enough!" and start processing. This is where the silence interval override value comes in. The silence interval override timer starts as soon as the first new data arrives after processing. Even if data trickles in just under the silence interval value, when the override period is exceeded, processing starts regardless of whether the silence interval requirement has been satisfied. You set the silence interval override to a value that makes sense for your requirements (of course you would!). By that we mean if you have batch updates occurring every hour, and they take about 10 minutes, you might set the silence override interval to 20 minutes. This means that there will be sufficient time to complete a batch update before processing starts, and as soon as the batch is finished, the silence interval will likely kick in and start processing ten seconds after the batch completes. If you have an exceptionally long batch, or the trickle updates are still defeating the silence interval, processing starts anyway. Okay, we are doing well. We have solved the problem of being so swamped with new data that it never gets processed. But what if processing takes too long? Our data will be too stale to meet our business requirements. In a worst-case scenario, we have data that is 20 minutes old (the silence override interval) plus the time it is going to take to process the partition. Guaranteeing Data Currency
So far, we have learned about optimizing when we reprocess an updated partition. We didn't want to waste resources if we were just going to have to restart with the arrival of new data. However, the user is still being given data from the MOLAP cache, and the cache is getting staler the longer we wait for processing to finish. If you have a requirement to present users with data that meets an "age" requirement, your only choice is to abandon the cache at that age. When the cache is not being used for data, queries are satisfied directly from the relational database (that is, by using ROLAP mode). Performance might not be good, but at least the data will be fresh. You control when the cache is dropped by setting a time for the drop outdated cache property. The timer for this starts as soon as the first new data arrives, just as it does for the silence interval override. If you cannot afford to have data more than one hour old, set this value to 60 minutes. This setting guarantees that regardless of the arrival of any volume of new data, or how long it takes to process it, the data delivered to a user will meet this "freshness" criteria. You can set this value anywhere from zero up. If you set it to zero, the cache is invalidated immediately, and new data is available immediately. This might put quite a load on the server because most queries will now be resolved using ROLAP mode. Unless there is a long quiet period, data trickling in may keep the cache invalid most of the time, and you have traded performance for very low latency. Minimizing the Cost of Reprocessing
In all the tuning parameters we have discussed, we were always confronted with the time required to reprocess a partition. If you have a lot of data, it's going to take a while to process it. However, we've already processed most of it; why bother to process it again? It's because the unit of processing is a partition, and so far we've only set up one partition containing all our data. Performance will deteriorate over time, depending on how much data is added over time. What we want to do is to put as much of the "old" data as possible in a partition by itself and leave it alone. We don't enable proactive caching for this partition, so it will not be reprocessed unless we manually initiate it. In another partition, we can put the most recent, or "current" data. We would enable proactive caching only on the "current" partition, which we expect to be substantially smaller and so require much less time to process. We would merge the current partition into the historical partition when there's enough data in it that performance is affected, and then we'd create a new "current" partition. Let's take a look at how that would work in practice. For this example, we'll assume that our historical and current partitions are split at June 1, 2005.
Notification of New Data
At this point, we'd like to briefly expand on the options for notifications used by proactive caching to kick off the processing of a partition. Specifying Change Notification Method
Proactive caching requires a notification to make it aware of new data. There are three basic methods available to you: notification "pushed" by SQL Server when a table changes, sending client-initiated XMLA message from a service or programmatically to Analysis Services, or polling at intervals to check for new data. You specify which method you want to use on the Notifications tab on the Storage Options dialog, as shown in Figure 12-5. As you saw earlier, you access the Storage Options dialog by clicking the Options button on the Partitions Storage Settings panel. But how do you choose which method is appropriate for your business requirements? SQL Server Notification SQL Server notification uses SQL Profiler to capture events that indicate a relevant table has been updated. This approach means that a notification is pushed to Analysis Services, which is much faster than Analysis Services polling for changes. Profiler does not guarantee that absolutely all update events will be detected, so it is possible that Analysis Services will not be notified, and your cube will not be as current as you expect. However, missed events usually result from heavy activity on the database, so it is likely that another event will trigger notification within a short time, and your cube will still be very current. SQL Server notification only works with SQL Server 2005 databases, because it uses the SQL Server Profiler to detect table changes. If your data source is something other than SQL Server 2005, you cannot use this method to notify Analysis Services of a change. Events are tracked through SQL Profiler using table names. Analysis Services can determine which table you are using in the partition and automatically monitor it. If you are using a view to populate the cube, you should specify the underlying tables that are actually being updated. In the preceding example, we used SQL Server Notification. Client-Initiated Notification You can use the XMLA command NotifyTableChanged to alert Analysis Services of a table change. You use this notification method when your data source is not SQL Server and you need faster response time than you might get by polling. In the sample code shown next, an XMLA command is sent to the server MyAnalysisServer. Analysis Services will be notified that the PatientDiagnosis table has changed. This table is used by the OLAP database DiagnosisProfile, as specified by the DatabaseID. The table resides in a database referenced by the data source Health Tracking DW, specified by the DataSourceID. Dim command As String = "<NotifyTableChange xmlns=""http://schemas.microsoft.com/analysisservices/2003/ engine"">" + _ "<Object>" + _ " <DatabaseID>DiagnosisProfile</DatabaseID>" + _ " <DataSourceID>Health Tracking DW</DataSourceID>" + _ "</Object>" + _ "<TableNotifications>" + _ " <TableNotification>" + _ " <DbSchemaName>dbo</DbSchemaName>" + _ "<DbTableName>PatientDiagnosis</DbTableName>" + _ " </TableNotification>" + _ "</TableNotifications>" + _ "</NotifyTableChange>" Dim client As New Microsoft.AnalysisServices. Xmla.XmlaClient client.Connect("MyAnalysisServer") client.Send(command, Nothing) client.Disconnect()
Polling by Analysis Services Polling is a method where a query is periodically sent to a data source by Analysis Services. It is useful when the data source is not SQL Server, and there is no application that can send an XMLA command to Analysis Services. Polling is usually less efficient and often means greater latency than other notification methods. On the Notifications tab, you specify a query that is to be used to check for new data, and how often it is to be sent. This query must return a single value that can be compared against the last value returned. If the two differ, a change is assumed to have occurred, initiating the proactive cache timing sequence. Ensuring Consistency During Processing
Just when you thought we had sorted out all the potential problems that could occur, some new fact records arrive in the data warehouse right in the middle of the cube processinghow do we avoid having an inconsistent picture? SQL Server 2005 has a useful new transaction isolation level called snapshot isolation, which uses row versioning to make sure that a transaction always reads the data that existed at the beginning of the transaction even if some updates have been committed since then. To get Analysis Services processing to use snapshot isolation, you can change the Isolation setting on the Analysis Services data source from the default ReadCommitted to Snapshot. If you try this out, you will get a warning message that snapshot isolation won't be used unless "MARS Connection" is enabled. You can turn on the MARS Connection property by editing the connection string in the Data Source Designer and then clicking the All button on the left to see all the advanced properties. Real-Time ETLWorking with the Data Source
We've got Analysis Services reading from the data warehouse automatically. How do we get data from the source into the data warehouse in real time? Periodic polling of the source system is the easiest to implement. You can use SQL Agent to schedule an SSIS package to run periodically to import new data into the data warehouse. This works well when you can accept latency on the order of several minutes or more, which is acceptable in many applications. Avoid using polling when the latency requirement is measured in seconds, because of the cost of initiating the package every few seconds (even if there's no new data). In our example, we used polling because we can accept a latency of a few minutes. We created a job in SQL Server Agent with an SSIS package step that invoked our import package that loads the data warehouse. We set a schedule for this job that specifies a recurring execution every two minutes. When setting this time for your schedule, consider how long it takes to reprocess the current partition and either reduce the frequency of execution if the processing takes a good part of the two minutes or reduce the time window on the current partition so the processing time will go down. You can initiate the importing of the data in other ways without polling if you have some control over the application that creates the data. Using a Message Queue
A message queue is a guaranteed delivery mechanism for sending data and messages between applications. The applications do not need to be on the same machine but, of course, do need some connectivity and authentication. MSMQ (Microsoft Message Queue) is a service that provides these features and is part of Windows Server. You can use an MSMQ task in an SSIS control flow to listen for messages indicating that new data is ready, and then launch the data flow task to begin the data import. Using Event Notifications
Windows Management Instrumentation (WMI) monitors the entire system for certain events, such as file creation. You can use a WMI task in SSIS to wait for an event and then proceed to the data flow task when the event occurs. For example, your data source application might push a flat file to a directory that you monitor with WMI. When the file is dropped in the directory, your package detects this and begins the process of loading the data. |