Multidimensional Databases: Problems and Solutions
|
|
In this section we set up a formal framework for source integration in Data Warehousing. In particular, our main goal is to define the notion of source integration systems, which is intended to represent the component of a data warehouse system dealing with the task of integrating the sources of information for the data warehouse system. We characterize a source integration system as constituted by three elements, namely, the global schema, the sources, and the mapping between the two. Finally, we provide the semantics both of the system, and of query answering.
The formal definition of a source integration system is given below.
Definition 1: A source integration system I is a triple
The following comments on the above formal definition are in order.
-
The global schema is intended to specify the structure of the information needed in the data warehouse. From a methodological point of view, such a schema is a reconciled view of the information stored in the sources. In what follows, we denote with AG the finite alphabets for the elements of the global schema. According to Devlin (1997), a conceptual data model, e.g., the entity-relationship model, is generally used for expressing the global schema. However, our formalization is completely independent from the particular data model used.
-
The source schema provides the specification of the structure of the various data sources. Such a schema contains the intentional description of all the sources of the data warehouse application. Although in principle the various source schemas may be expressed using different data models and notations, it is common to define suitable wrappers that present all the schemas of the sources in a predefined form, e.g., in terms of the relational model. Therefore, the source schema is usually expressed as a set of relation schemas. In what follows, we denote with
the finite alphabets for the elements of the source schema. -
The mapping
establishes a relationship between elements of the global schema and those of the source schema . As we already said in the introduction, two basic approaches, namely GAV and LAV, have been proposed for specifying the mapping, and we will distinguish between these two types of mappings when specifying the semantics of a source integration system.
Let us turn our attention to the semantics of a source integration system
Definition 2: Let
-
is legal with respect to , i.e., satisfies all the constraints of ; -
satisfies the mapping with respect to .
The notion of
-
GAV mapping. In the GAV approach, the mapping
associates to each element r in a view, i.e., a query, over , denoted by ρ(r). We say that satisfies with respect to if, for each element r of , the set of tuples rB that assigns to r contains the set of tuples that satisfy the query ρ(r) in , i.e., Note that this means that the view associated to r is sound: the data provided by the sources satisfy the element of global schema, but are not necessarily complete.
-
LAV mapping. In the LAV approach, instead, the mapping
associates to each source s in a view, i.e., a query, over , denoted by ρ(s). In this case, we say that B satisfies with respect to , if for each source s of , the set of tuples that assigns to is contained in the set of tuples that satisfy the query ρ(s) in , i.e., Note that, analogously to the previous case, this means that the view associated to s is sound.
Queries posed to a source integration system I are expressed in terms of a query language
Definition 3: Let
Since, in general, several global databases exist that are legal for I with respect to
As we said in the introduction, the main activities that are carried out in the design of a source integration system are: schema integration, data integration, and data cleaning. To relate these activities to the formalization presented in this section, we observe that:
-
Schema integration has the goal to provide the specification of the three main components of the system, namely, the global schema, the source schema, and the mapping.
-
Data integration aims at defining the correct method for acquiring data from the sources, so as to populate (either virtually or physically) the elements of the global schema. In other words, the purpose of data integration is to come up with a suitable method for answering queries over the global schema, by accessing the data at the sources.
-
The goal of data cleaning is to design the mapping
of the source integration system in such a way that, when acquiring data at the sources, suitable conversion, transformation, and reconciliation actions are performed on these data.
|
|