Multidimensional Databases: Problems and Solutions

The quality of data is very important for company business (Stoker, 2000). The ability of a company to make critical decisions in a complex business environment largely depends on how the company uses information. If the quality of company data is poor, also the information produced by data mining tools is not correct, due to the well-known "garbage in, garbage out" principle.

Usually, a large part of the data of a company are stored in legacy databases, which complicate the task of building a data warehouse. When information coming from heterogeneous sources is gathered and integrated, it is likely that data quality problems are present. This is because there may be inconsistencies due to the fact that different sources contain redundant data with different representations. Moreover, problems may arise within a single source: possible problems are missing information, misspellings, or any kind of invalid data.

Data cleaning (Bouzeghoub & Comyn-Wattiau, 1990; Rahm & Do, 2000; Quass, 1999) consists in detecting and removing inconsistencies from data, in order to improve data quality. It constitutes the most important step of the so-called ETL (extraction, transformation, loading) process. The task of data cleaning is orthogonal to that of semantic data integration. Even if query answering problems are solved in a data warehouse, retrieved data may still be dirty and inconsistent.

The process of data cleaning is disregarded by many IT managers (Maydanchick, 2000). One of the reasons of this is the high and unpredictable cost of this operation. Moreover, it is very difficult to plan a data cleaning process, due to the overwhelming complexity of business rules in a modern data warehouse; without such a plan, not only is the data cleaning procedure unlikely to end on time, but measuring the success of the operation becomes impossible.

Henceforth, we will assume that data are represented in the relational model. There are many reasons for data to be dirty; Rahm & Do (2000) and Quass (1999) present a number of them. We group data problems in two main categories: invalid data and differences in representation of the same data.

A major problem in data cleaning is that of overlapping data (Hernàndez & Stolfo, 1995; Monge & Elkan, 1996, 1997; Roychoudhury, Ramakrishnan, & Swift, 1997; Monge & Elkan, 2000), also referred as duplicate elimination problem or merge/purge problem. This problem arises when different records of data representing the same real-world entity are gathered in a data warehouse. In this case, duplicates have to be detected and merged. Most efforts in data cleaning have been devoted to the solution of the duplicate detection problem, which proves to be a central issue. In fact, in order to be able to detect different records related to the same real-world entity, a system has to resolve all conflicts relating that entity, due to different representations in different sources. In the following, we will focus our attention on the duplicate elimination problem.

Before starting the actual data cleaning process, a pre-processing is usually performed. Pre-processing consists in standardization operations (domain-dependent standardization techniques are presented, e.g., in Borkar, Deshmukh, & Sarawagi (2000) and Roychoudhury, Ramakrishnan, & Swift (1997), for example simple format conversions, or discovering abbreviations. The aim of this preliminary step is to make candidate duplicate records more similar, in order to make their identification easier during the duplicate detection step.

Two metrics (Low, Lee, & Ling, 2001) are usually defined to measure the effectiveness of a data cleaning system: recall and false-positive error. Recall is also known as percentage hits, true positives, or true merges, and it is defined as the percentage of the duplicate records which are correctly identified, out of all duplicate records. The false-positive error is the percentage of records which are wrongly identified as duplicates, out of all identified records. This metrics, also known as false merges, is related to the precision metrics, being precision = 100% − falsepositive errors.

Determining exact duplicates is an easy task (Bitton & DeWitt, 1983), and can be done by sorting all records and checking whether there are identical neighbors. Detecting inexact duplicates relating to the same real-world entity is a quite complex task instead. Once we have a technique to compare two records, if we have N records, the naïve approach would require one to make N2 comparisons, which is obviously inefficient.

In Hernàndez & Stolfo (1995), a more efficient technique is presented, called sorted neighborhood method. This method consists of three steps:

The main drawback of the sorted neighborhood method lies in the fact that the merging heavily depends on the sorting, and therefore on the key. If two inexact duplicates fall far apart after sorting, they will never be recognized as duplicate. This problem can be partly solved by enlarging the size of the window, thus improving the recall; on the other hand, this increases the computational complexity of the merging phase.

As the choice of the key is crucial in sorted neighborhood method, in order to increase recall, a good strategy is that of performing multiple passes (Monge & Elkan, 1997; Hernàndez & Stolfo, 1995), each one with a different key, and computing the transitive closure of the relations computed at each step. That is, if record A matches with record B in a scan, and record B matches with record C in another, then we deduce that A matches with C. This approach solves the problem of the window size, as multiple passes with a small window size have shown to perform better, and to be more efficient, than a single scan with a large window size. On the other hand, introducing transitivity increases the false-positive errors, lowering the overall precision.

An alternative to window scan consists of partitioning the records in clusters, where in each cluster records are stored that match each other according to a certain criterion. Duplicate elimination can be performed on each cluster separately, and in parallel. This approach suffers from the same drawbacks as the usual sorted neighborhood method; moreover, if the database is small, many very small (or even singleton) clusters are likely to occur.

A clustering technique, presented in Monge & Elkan (1997), makes use of a priority queue, containing a set of clusters. Records not already appearing in the clusters are compared with representative records of each cluster: if the match succeeds, then the record is added to the corresponding cluster. If the record cannot be put in any cluster, then a new cluster is created, containing only that record. As the priority queue could increase significantly, a maximum number of clusters is fixed; if the limit is exceeded, then the cluster with lowest priority is removed. At any time, the cluster that was most recently created has the highest priority. In this technique, the choice of the representative records is crucial, and it strongly influences the results. Heuristics have to be developed in order to optimize such choice.

Finally, knowledge-based approaches have been used in data cleaning. In Galhardas et al. (2001), an attempt of separating the logical and the physical level is done: the logical level supports the design of the data transformation procedures, while the physical level supports the implementation and optimization of such procedures. In this framework a declarative language for data cleaning transformations is presented. Also, a declarative specification of user interaction and a declarative way to specify properties of the matching operations are introduced. These techniques have been implemented in an actual data cleaning system.

A framework for knowledge-based data cleaning, based on a set of rules, has been proposed in Low, Lee, & Ling (2001). In the approach presented in this paper, the processing stage is performed according to a set of rules: duplicate identification rules, merge/purge rules, update rules, alert rules (specifying some interaction with the user). Also, a pre-processing step is defined, in which data standardization is executed on the data. At the end of the data cleaning process, a framework for human validation and verification of the results is proposed.

In conclusion, the problem of cleaning data in the context of data warehousing is still far from having a solution. The identification of inexact duplicates is inherently characterized by a certain degree of uncertainty, and this means that human validation is needed in order to evaluate the performance of a data cleaning system. On the other hand, human inspection is infeasible when the size of the data is very large, a case that is quite common in practice. Another serious problem is that of the dependence on the domain: an algorithm performing very well on a database may behave badly on another. Declarative approaches seem to introduce some flexibility, which may produce good results on several domains. Anyway, the results in this case are again strongly dependent on the knowledge provided to the system, which in turn is produced by human reasoning.

Категории