Multidimensional Databases: Problems and Solutions
|
|
In the database community, the cooperation between GDBs and MDDBs is indicated by taking into account the notion of multiple representations of space or map generalization. Generalization has been the subject of research by the cartographic and GIS communities. These studies refer to the geometric and modeling aspects of the generalization process. Geometric generalization processes are used solely for graphic display, and they are intended for modeling spatial data for visual interaction through map scaling. Modeling generalizations consider the impact of scale and resolution on spatial data modeling and querying (see Muller, Lagrange, & Weibel, 1995). In particular, some attempts have been made to look for a standard set of multidimensional (or statistical) operators based on aggregation/disaggregation. The case of spatial partitioning played a central role in finding such a set of operators. For instance, Gargano et al. (1991) have presented an extension of the relational model by ADTs in order to deal with spatial data and complex aggregated data. They extended the relational algebra essentially by defining two algebraic operators that are able to manipulate either the spatial extension of geographic data or summary (statistical) data. They are named G-Compose and G-Decompose. The first operator is denoted by G − ComposeX (Fy;Y), where X and Y are two non-intersecting subsets of attributes of a relation R. It "merges" all tuples of R which are already projected on Y in a single one whose Y-value is generated by the application of the fusion function. This function, which is represented by Fy, takes a subset of elements of a given type and returns a single element of the same type. In the case of summary data, Fy aggregates the numeric values of Y attributes. The effect of G-Decompose is that all tuples of R projected on Y are "decomposed."
The effect of G-Compose is analogous of the operator "Aggregate Format" proposed by Özsoyoglu, Özsoyoglu, & Victor (1987) for statistical data. Contrary to the "Aggregate Format" operator where the aggregation function is unique, Fy in G-Compose can be a collection of different fusion functions. For a detailed description of these operators, see Gargano, Nardelli, & Talamo (1991).
In the case of summary data, G-Compose is equivalent to the slice operator in OLAP databases or summarization in statistical databases (see Rafanelli & Ricci, 1993). In the proposal by Gargano, Nardelli, & Talamo (1991), the fundamental issues of hierarchies and data aggregation for either spatial or summary data have been omitted. These issues are discussed later in an approach proposed by Rigaux & Sholl (1995).
In this work, the authors make the bridge between the geographic and statistical disciplines by defining an aggregation technique over a hierarchy of space partitions. Their model is based on the concept of "partition" which is used for partitioning either geometric space or other sets (e.g., a set of people). Such a set (geometry or people) is called, for the sake of simplicity, a population. The set of partitions on a generic population E is represented by P(E). It can be the domain of a generic attribute Ag. They have introduced the concept of cover, which is essentially a relation defined by the schema O = {A1,…, An, Ag} such that πAg (O) is a partition in P(E), and there is a biunivocal functional dependency between the attributes A1,…, An and Ag. They defined the geometric projection operator on a subset of attributes S = {A1,…, Aq} as follows:
where πAg (O) is the N1NF grouping operation (see Abiteboul & Bidoit, 1986) on S and ∑Geo : {Ag} → Ag performs the geometric aggregation function. The operation (nests(πS,Ag O)) gives the result with the schema {A1,…, Aq, B}, and ∑Geo performs the union aggregation function on attribute B = set(Ag).
For representing summary data, they use the notion of cover, but each descriptive attribute A can be defined on a hierarchical domain. The same operator is redefined as below:
where before applying the nest operator, the abstraction level of hierarchy to which the attribute A belongs is changed. The effect of this operator, which is indicated by genA:A′ (O), is the same as the roll-up operator defined by Cabibbo & Torlone (1998) where in each tuple, attribute A value is replaced by its ancestor (A′) belonging to the hierarchy. The result of such an operator is a relation that is no longer a cover, since there are several tuples with the same value for A. Note that in this case, ∑Geo performs the numeric aggregation function SUM.
The model proposed by Rigaux & Scholl (1995) is addressed to generate maps in multiple representations of data using the hierarchy of space partitions and the hierarchy induced by a partial order relationship in the domain of an attribute. In this proposal only one location dimension hierarchy for summary data is considered.
The issue of aggregation of spatial data has also been considered by Shakhar et al. (1999) from a different point of view with regard to the previous proposals. The authors extend the concept of the data cube introduced by Gray et al. (1997) to the spatial domain (called spatial data cube) by proposing the map cube operator. This operator takes as its arguments: a base map, a base table, a geographic hierarchy, and a set of cartographic preferences. It adds cartographic visualization to the spatial data cube. Such an operator is aimed at generating a collection of maps corresponding to the power sets of all possible spatial and non-spatial aggregation, which can be browsed using OLAP operators.
While the above models give a formal definition for the cooperation between spatial and multidimensional environments, some other works consider the architectural aspects of an integration system. For instance, Kouba, Matousek & Miksovsky (2000) tried to identify some requirements for the correct and consistent functionality of system interconnection. They proposed an integration module which has two different roles: one is the transformation of data from external data sources, and the other refers to the integration of GIS and data warehouse through their common components. The integration module coordinates the actions carried out by the GIS system and data warehouse. The GIS under consideration is based on an object-oriented model which identifies the basic GIS elements that are objects and classes. In the GIS system, the structure of the geographical class hierarchy is stored in a metadata object for accessing directly from the integration module.
In this work, cooperation is carried out by the common elements which are the data warehouse "location" dimension aggregation levels and the GIS objects' taxonomy. The task of the integration module is to provide the following three types of mapping:
-
Class correspondence maps particular GIS taxonomical levels on the corresponding location dimension and vice versa.
-
Instance correspondence maps particular instances of aggregation levels on the instances of the geographic classes and vice versa.
-
Action correspondence consists of the processing of queries in one environment which require information stored in another environment. It guarantees navigation consistency, and provides the information to be modified in the integration module and propagated to the data warehouse and GIS.
Moreover, the system is integrated by a front-end module able to display the results in output. Furthermore, the implementation aspects of the integration of a data warehouse that is the Microsoft SQL Server 7 and ArcView GIS System is also discussed.
Paolucci et al. (2000) considered the integration of several spatio-temporal data collections of the Italian National Statistics Institute. The integration system, called SIT-IN, is defined mainly by a historical database containing the temporal variation of territorial administrative partitions; a statistical data warehouse providing statistical data from a number of different surveys; and a GIS providing the cartography of the Italian territory up to census tract level. The implemented cooperative systems manage the maps of the temporal evolution of a certain number of administrative regions and link to these maps the content of the above-mentioned statistical database of a given year.
The approach that will be discussed in this chapter shares a number of characteristics and goals with the above-mentioned works. Like the proposals of Gargano et al. (1991) and Rigaux & Scholl (1995), the main goal is to provide a formal approach for cooperative query answering. The previous approaches aimed at defining a set of operators applicable to either spatial or summary data without dealing with the "logical organization" of databases at all. Consequently, it is not possible to handle summary queries in the context of GDBs. Conversely, our approach relies on a formal logical model that provides a solid basis for the study of summary data manipulation in GDBs.
Unlike our approach, there are some multidimensional issues that are not considered explicitly in the above-mentioned models like the notion of multiple location dimension hierarchies. Their models are based on multidimensional data formed by solely one location dimension, whereas in our approach we also consider data defined by more than one location dimension and we analyze their effect on data modeling and query answering.
The works by Kouba et al. (2000) and Paolucci et al. (2000) are aimed mainly at technical realization and discuss some interesting issues related to the integrating system implementation. However, the focus of their papers is on the development of querying rather than logical data modeling.
|
|