Data Mining: Opportunities and Challenges
We applied our method to the Chinese news articles posted daily on the Web by the CNA (Central News Agency). Two corpora were constructed in our experiments. The first corpus (CORPUS-1) contains 100 news articles posted on August 1, 2, and 3, 1996. The second corpus (CORPUS-2) contains 3,268 documents posted between October 1 and October 9, 1996. A word extraction process was applied to the corpora to extract Chinese words. A total of 1,475 and 10,937 words were extracted from CORPUS-1 and CORPUS-2, respectively. To reduce the dimensionality of the feature vectors, we discarded those words that occurred only once in a document. We also discarded the words that appeared in a manually constructed stoplist. This reduced the number of words to 563 and 1,976 for CORPUS-1 and CORPUS-2, respectively. A reduction rate of 62% and 82% was achieved for the two corpora respectively. To train CORPUS-1, we constructed a self-organizing map that contains 64 neurons in an 8 8 grid format. The number of neurons was determined experimentally so that a better clustering could be achieved. Each neuron in the map contains 563 synapses. The initial training gain was set to 0.4, and the maximal training time was set to 100. These settings were also determined experimentally. We tried different gain values ranging from 0.1 to 1.0 and various training time setting ranging from 50 to 200. We simply adopted the setting that achieved the most satisfying result. After training, we labeled the map by documents and words respectively, and obtained the DCM and the WCM for CORPUS-1. The above process was also applied to CORPUS-2 and obtained the DCM and the WCM for CORPUS-2 using a 20 20 map. After the clustering process, we then applied the category generation process to the DCM to obtain the category hierarchies. In our experiments, we limited the number of dominating neurons to 10. We limited the depths of hierarchy to 2 and 3 for CORPUS-1 and CORPUS-2, respectively. In Figures 4 and 5, we show the overall category hierarchies developed from CORPUS-1. Each tree depicts a category hierarchy where the number on the root node depicts the super-cluster found. The number of hierarchies is the same as the number of super-clusters found at the first iteration of the hierarchy generation process (STAGE-1). Each leaf node in a tree represents a cluster in the DCM. The parent node of some child nodes in level n of a tree represents a super-cluster found in STAGE-(n−1). For example, the root node of the largest tree in Figure 4 has a number 35, specifying that neuron 35 is one of the 10 dominating neurons found in STAGE-1. This node has 10 children, which are the 10 dominating neurons obtained in STAGE-2. These child nodes comprise the second level of the hierarchy. The third level nodes are obtained after STAGE-3. The number enclosed in a leaf node is the neuron index of its associated cluster in the DCM. The identified category themes are used to label every node in the hierarchies. In Figure 6, we only show the largest hierarchy developed from CORPUS-2 due to space limitation. We examined the feasibility of our hierarchy generation process by measuring the intra-hierarchy and extra-hierarchy distances. Since text categorization performs a kind of clustering process, measuring these two kinds of distance reveals the effectiveness of the hierarchies. A hierarchy can be considered as a cluster of neurons that represent similar document clusters. These neurons share a common theme because they belong to the same hierarchy. We expect that they will produce a small intra-hierarchy distance that is defined by: where h is the neuron index of the root node of the hierarchy, and Lh is the set of neuron indices of its leaf nodes. On the other hand, neurons in different hierarchies should be less similar. Thus, we may expect that a large extra-hierarchy distance will be produced. The extra-hierarchy distance of hierarchy h is defined as follow: Table 1 lists the intra-and extra-hierarchy distances for each hierarchy. We can observe that only one of twenty hierarchies has an intra-hierarchy distance greater than its extra-hierarchy distance. Therefore, we may consider that the generated hierarchies successfully divide the document clusters into their appropriate hierarchies.
We also examined the feasibility of the theme identification process by comparing the overall importance of an identified theme to the rest of the terms associated with the same category. For any category k, we calculated the average synaptic weight of every terms over Ck. Let tn be the term corresponding to the nth component of a neuron's synaptic weight vector. We calculate the average synaptic weight of tn over category k by Table 2 lists the ranks of the identified themes over all terms for all hierarchies. It is obvious that the identified themes are generally the most important term among all terms and therefore should be the themes of these hierarchies.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|