Multidimensional Databases: Problems and Solutions
|
|
One of the predominant operations on multidimensional aggregate data is that "to remove" a dimension from a multidimensional aggregate data (obtaining, for example, a "Population by year and age-groups" from a "Population by year, age-groups, and sex"). Such an operation is often called summarization. This operator works with only one operand, and it produces a recomputed measure (in the case of numerical values) or instances formed by sets of alphanumeric values (in the case of non-numerical values). The first formal proposal of summarizing an attribute, reducing the number of dimensions of a MAD (in that case, a table), was made in Rafanelli & Ricci (1984, 1985), and subsequently in Fortunato et al. (1986), Rafanelli & Shoshani (1990), and Rafanelli & Ricci (1993). In Gyssens & Lakshamanan (1997), the authors study this operator both on relations, and on tables.
Other different terms have been used for this operation. Among them, we remember aggregation, informally discussed in Shoshani & Wong (1985), where, among the different concepts discussed, there is that of the "collapsing" of multidimensional data structures in order to remove a certain dimension; attribute removal by aggregation in Ozsoyoglu, Ozsoyoglu, & Mata (1985), slice (term especially used for OLAP applications) in Gyssens & Lakshamanan (1997) and in Shoshani (1997), and destroy dimension in Agraval, Gupta, & Sarawagi (1997). Since the term "aggregation" has been widely used in this chapter to denote a different concept, in the following we will use the terms summarization (which often refers to the statistical databases) and removing with the same meaning. Often, when referring to the relational algebra, this operator is called projection, as in Ozsoyoglu, Ozsoyoglu, & Matos (1987), Pedersen & Jensen (1999), and Pedersen, Jensen, & Dyreson (2001), with very few differences.
As already mentioned, this operator deletes one category attribute of a MAD, with consequent recomputation of the summary attribute values. This recomputation is not always possible: for example, if the measure is not numeric, or if, in the case of numeric values, the summary type of the MAD is "average." In this latter case we need the relative "count" and "sum" aggregate summary values, or the raw data, to which to apply the aggregation process again. Since a multidimensional aggregate data structure represents a functional link between sets of raw data (rather than n-tuples of dimension instances) and measures, in our framework summarization is the operation that allows the user to (implicitly or explicitly) delete one attribute (which, in this case, represents one dimension of the MAD), or to transform it into an implicit one, and to recompute the measures accordingly.
In the following, in order to avoid ambiguity, we will distinguish the total summarization or T-SUMMARIZATION (which implies the removal of the dimension) from the implicit summarization or I-SUMMARIZATION (which transforms the dimension from explicit to implicit, and the set of definition domain instances in only one set-value, which resume all the values of the original domain, but not all the values of the primitive attribute definition domain). The descriptive space of the MAD reduces itself to one dimension without loss of information only in the first case.
In Bezenchek, Rafanelli, & Tininini (1996a) and, subsequently, in the ADAMO model (see Bezenchek, Rafanelli, & Tininini, 1996b) in Chapter 1, the above-mentioned distinction between total summarization and implicit summarization was made. Therefore, with the introduction of the "implicit attribute" concept, the summarization operator has been refined. The (total or implicit) summarizability of a category attribute, or its non-summarizability, depends on three interdependent factors, namely:
-
the partitioning characteristics of the category attribute
-
the fact described by the MAD;
-
the aggregation function type applied to the raw data to obtain this MAD.
In particular, it has been shown that the partitioning characteristics of
For example, let us consider the MAD "Number_of_ cars_produced_in_Japan" in Figure 10, described by "model" and "years" (but also by "country," where this dimension is "implicit" because it has only one value, "Japan," which appears in the title of the MAD). Suppose we wish to have the total number of cars produced only per "years." In this case we have to apply the summarization operation to the category attribute "model." Because the instances of the category attribute model are <Corolla, Civic, Corona>, and because these are not the only car models produced in Japan in that period, the operator applied will be I-SUMMARIZATION. In this way the category attribute "model" will be transformed into an implicit attribute and a note will be added to the MAD, as shown in Figure 11.
We remember that in the Chapter 1 we gave the formal definition of a simple MAD s1. Therefore, given a phenomenon x and given the set of all the relations Rx (of the micro database) involved in the production of all the MAD which describe this phenomenon, we considered the subset of Rx formed only by the relations involved in the building of fact
Let us suppose the summarizability conditions, discussed in the previous Chapter 4, have been verified. Then s1 is a MAD defined on the base relation
The summarization of s1 with respect to A1x (with x ∊ {1, …, M}) produces a new MAD
where N,
A1x becomes an implicit category attribute of s'1. It can completely disappear from the new descriptive space {A1'j'}, with j' = 1, …, x−1, x+1, …, s, if its definition domain Δ (Ex) completely covers the definition domain of the unique top-level category attribute (denoted by ALL, see Gray et al., 1997) of the hierarchy to which it belongs. If, instead, A1x does not belong to any hierarchy, it disappears only if its definition domain coincides with its primitive definition domain.
|
|