Data Mining: Opportunities and Challenges
USE OF DATA MINING IN DESIGNING THE SYSTEM KNOWLEDGE BASE
The main function of the system to be developed is to guide information requirements from users to the domains that offer the greatest possibility of meeting them. The purpose of this functionality is to prevent each query from being systematically directed to all domains, increasing the traffic in the net and interfering with the normal operation of domains. For that purpose, it is necessary to develop a strategy that provides the system with capacity for analyzing an information requirement and determining to which domains it will be directed. This can be formalized as follows: Let D be the set of all domains being part of the system, given an information requirement qi, the system must define the subset of domains DP ⊂ D, that has the potential for answering the information requirement. Let dR be the domain that can provide the required information, the system will be effective if dR ∊ DP; and the lower the number of domains included in Dp, the greater the system efficiency. Then, for each information requirement qi there will be a set of domains DP that has the potential for providing the required information, and a set of domains DN that are not able to provide the required information. In this way, the process can be seen as a classification of domains into: domains with potential for providing the required information, and domains without potential for providing the required information. To carry out such a classification, a discriminating analysis can be employed (Kachigan, 1991). The needed data for the posed classification process could be hypothetically obtained allowing the system to initially operate without a process to drive queries to potential domains. Instead, it systematically sends each query to all domains. As it was specified in the previous section, queries are formulated by the user in his natural language, then the text is cleaned, deleting connectors, pronouns, articles, etc., just leaving the words considered as key for the query. Thus, the set of keywords Kqi of query qi is obtained. In this way, for each query qi the data presented in Table 1 will be obtained.
The first field of Table 1 is called Query. It identifies the query qi sent by the user. The second field called Domain identifies each consulted domain dj. The third field, which is called ValuationAnswer (VA), contains the valuation of the result to a query. ValuationAnswer identifies a qualitative variable va (dj, qi) that can take [Positive, Negative, Null] values. The Positive value indicates that the required information qi was provided by the consulted domain dj. The Negative value indicates that the consulted domain dj answered the information query qi in a negative way, and the Null value indicates that domain dj did not provide any answer to query qi. Each of the remaining fields of Table 1 represents a keyword of the set of all keywords Kq of queries. Each field is named by a keyword and represents a binary variable pk that takes 1 as its value if the keyword is stated in the query, and 0 if it is not stated in that query. Once a considerable amount of data is stored, these data can be used to define a discriminating function that allows classifying the domains for each query. It should be said that domains are the objects subjected to analysis. We infer that the keywords included in the queries that have been positively answered (va =[Positive]) by a domain allow us to classify the latter. Then, we define variables pk as predicting variables. The criterion variable is a qualitative one defined by the classification labels [Potential domain] and [Non-potential domain]. Then, we define the discriminating function as follow: where wk are the weights associated with each respective predicting variable, and RSV is the resulting domain's discriminating score. Each domain (object) dj ∊ D will have a value on this discriminating function depending upon its values on the predicting variables p1, pk, pn. The criterion variable is qualitative, and thus it will be necessary to define a cutoff score. The cutoff score will be equal to or greater than zero. Domains with RSV equal to or greater than the cutoff score are assigned to the DP criterion set (Potential domain set), and domains with a RSV lower than the cutoff score are assigned to the DN criterion group (Non-potential domain set). On the other hand, given an information requirement qi, the RSV value of each domain classified into the DP criterion set can be used to rank these domains. That is, the domain dj∊ DP with the greatest RSV will be ranked first, the domain dj∊DP with the second greatest RSV will be ranked second, and so forth.
Initial Definition of the Discriminating Function Weights
An alternative for calculating the weights of the above-described discriminating function would be to implement a system without the main functionality (capacity for orienting user's query to the domains with the greatest possibility to answer that query) and to operate it for the time enough to store the necessary data to calculate such weights. Another alternative would be to carry out an initial estimation of the discriminating function weights, so that the main functionality of the system can be developed. For this purpose, our proposal consists of working with experts of each domain, so that they provide an initial set of keywords (predicting variables) that can characterize the information that could be provided by that domain. This set of keywords we represent as Kdj defines a taxonomy of domain dj. Then, using the Analytical Hierarchy Process or AHP method (Saaty, 1970), support is provided so that experts of each domain, according to their expertise, can define the weights of the discriminating function associated to its domain (Stegmayer, Taverna, Chiotti, & Galli, 2001). One initial discriminating function is defined for each domain; the former can be used to define a knowledge base about the information that can be provided by the different system domains. Such a knowledge base could be used to provide the system with capacity for guiding the user's information requirements to the domains with the greatest possibility of answering that query. Since discriminating functions have been defined by experts of domains, it is not strange that classification errors do not affect the system efficacy. That is to say, domain dR that can answer to such a query obtains a RSV(dR, qi)>cutoff score and comes to be classified into the DP criterion set (Potential domain set). Nevertheless, classification errors may affect the system efficiency, and thus other domains dj whose RSV(dj,qi)>cutoff scores are assigned to the DP criterion set. In fact, this classification error only affects the system efficiency when RSV(dj, qi) > RSV(dR, qi). In that case, domain dR ∊ DP but it does not have the first place in the ranking. Classification errors of the initial discriminating function may be due to two main causes:
It is necessary to highlight that these errors can be caused by the experts themselves in the discriminating function-definition process, but they could also originate from the evolution of domains that can take place as time goes by. Anyway, it will be necessary to update the system knowledge base to avoid these errors. This updating can be carried out by analyzing the results of all those cases in which there were classification errors. In other words, domain dR that answered to the respective query qi did not have the greatest RSV(dR, qi), and therefore it was not ranked first. This process can be defined as a learning process that must be structured so that it can be automatically developed by the system. In the following section, we describe the use of data mining for designing the learning process.
| |||||||||||||||||||||||||||||||||||||||||||||||||
|