Multidimensional Databases: Problems and Solutions

In this chapter we have introduced the main aspects of data source integration, which is the process of transferring data from a set of heterogeneous data sources into a data warehouse. Source integration comprises several complex aspects, which make it a very difficult task. Schema integration consists of integrating the schemas of the sources in order to obtain a homogeneous representation of the whole set of data, called global schema, and in specifying the mapping between the global schema and the sources. We have shown the various steps of the schema integration process, comparing the different approaches that have been adopted in the literature. Data integration consists of making data sources available through a set of materialized views, retrieving the data from the sources themselves. We have illustrated several techniques and mechanisms proposed in the literature to deal with this task, classified according to the approach used to specify the mapping (LAV vs. GAV). Data cleaning and reconciliation consists of removing inconsistencies in the data retrieved from the sources, which are due to errors in the data or differences in the representations of the data themselves in different sources. After having shown the main causes of such "dirt," we have presented the main approaches to the duplicate detection problem, which is a major task in data cleaning; it consists of the identification of distinct records relating the same real-world entity.

A lot of work is still to be done in the field of source integration. This problem is very hard, and all its aspects need further investigation both from the theoretical and practical point of view. The latest approaches aim to provide automatic reasoning services; yet, some source integration tasks inherently depend on the designer's knowledge.

Категории