Data Mining: Opportunities and Challenges

Chapter VIII - Mining Text Documents for Thematic Hierarchies Using Self-Organizing Maps
Data Mining: Opportunities and Challenges
by John Wang (ed) 
Idea Group Publishing 2003

Brought to you by Team-Fly

To obtain the category hierarchy, we first perform a clustering process on the corpus. We then apply a category hierarchy generation process to the clustering result and obtain the category hierarchy. This section describes the clustering process. We will start with the preprocessing steps, follow by the clustering process by the SOM learning algorithm. Two labeling processes are then applied to the trained result. After the labeling processes, we obtain two feature maps that characterize the relationship between documents and words, respectively. Two maps are obtained after the labeling processes. The category hierarchy generation process, which will be described in the next section, is then applied to these two maps to develop the category hierarchy.

Preprocessing and Encoding Documents

Our approach begins with a standard practice in information retrieval (Salton & McGill, 1983), i.e., the vector space model, to encode documents with vectors, in which each element of a document vector corresponds to a different indexed term. In this work the corpus contains a set of Chinese news documents from the Central News Agency (CNA). First, we extract index terms from the documents. Traditionally, there are two schemes for extracting terms from Chinese texts. One is a character-based scheme and the other is a word-based scheme (Huang & Robertson, 1997b). We adopted the second scheme because individual Chinese characters generally carry no context-specific meaning. Two or more Chinese characters compose a word in Chinese. After extracting words, they are used as index terms to encode the documents. We use a binary vector to represent a document. A value of 1 for an element in a vector indicates the presence of the corresponding word in the document; otherwise, a value of 0 indicates the absence of the word.

A problem with this encoding method is that if the vocabulary is very large, the dimensionality of the vector is also high. In practice, the resulting dimensionality of the space is often tremendously huge, since the number of dimensions is determined by the number of distinct index terms in the corpus. In general, feature spaces on the order of 1,000 to 100,000 are very common for even reasonably small collections of documents. As a result, techniques for controlling the dimensionality of the vector space are required. Such a problem could be solved, for example, by eliminating some of the most common and some of the most rare indexed terms in the preprocessing stage. Several other techniques may also be used to tackle the problem; e.g., multidimensional scaling (Cox & Cox, 1994), principal component analysis (Jolliffe, 1986), and latent semantic indexing (Deerwester, Dumais, Furnas, & Landauer, 1990).

In information retrieval, several techniques are widely used to reduce the number of index terms. Unfortunately, these techniques are not fully applicable to Chinese documents. For example, stemming is generally not necessary for Chinese texts. On the other hand, we can use stop words and a thesaurus to reduce the number of index terms. In this work, we manually constructed a stop list to filter out the meaningless words in the texts. We did not use a thesaurus simply because it was not available. Actually, the self-organizing process will cluster similar words together (Lee & Yang, 1999) and automatically generate a thesaurus. We have initiated a project for this part of research. We believe that the dimensionality of the document vectors can be dramatically reduced if we successfully applied such a thesaurus.

A document in the corpus contains about 100-200 characters. We discarded over-lengthy (more than 250 words) and duplicate documents for a better result. This step is not necessary because it affects less than 2% of documents in the corpus. Keeping these documents may only slightly increase the number of index terms and the processing time. However, if the corpus contains many duplicate or over-lengthy documents, the clustering result may be deteriorated.

We used the binary vector scheme to encode the documents and ignore any kind of term weighting schemes. We decided to use the binary vector scheme due to the following reasons. First, we clustered documents according to the co-occurrence of the words, which is irrelevant to the weights of the individual words. Second, our experiments showed no advantage in the clustering result to using term weighting schemes (classical tf and tf idf schemes were used), but an additional amount of computation time was wasted on computing the weights. As a result, we believe the binary scheme is adequate to our needs.

Generating the Word Cluster Map and Document Cluster Map

The documents in the corpus were first encoded into a set of vectors by the terms that appeared in the documents, as in a vector space model. We intended to organize these documents into a set of clusters such that similar documents would fall into the same cluster. Moreover, similar clusters should be "close" in some regard. That is, we should be able to organize the clusters such that clusters that contain similar documents should be close in some measurement space. The unsupervised learning algorithm of SOM networks (Kohonen, 1997) meets our needs. The SOM algorithm organizes a set of high-dimensional vectors into a two-dimensional map of neurons according to the similarities among the vectors. Similar vectors, i.e., vectors with small distance, will map to the same or nearby neurons after the training (or learning) process. That is, the similarity between vectors in the original space is preserved in the mapped space. Applying the SOM algorithm to the document vectors, we actually perform a clustering process about the corpus. A neuron in the map can be considered as a cluster. Similar documents will fall into the same or neighboring neurons (clusters). In addition, the similarity of two clusters can be measured by the geometrical distance between their corresponding neurons. To decide the cluster to which a document or a word belongs, we apply a labeling process to the documents and the words respectively. After the labeling process, each document associates with a neuron in the map. We record such associations and form the document cluster map (DCM). In the same manner, we label each word to the map and form the word cluster map (WCM). We then use these two maps to generate the category themes and hierarchy.

We define some denotations and describe the training process here. Let xi ={xin|1≤n≤N}, 1≤i≤M, be the encoded vector of the ith document in the corpus, where N is the number of indexed terms and M is the number of the documents. We used these vectors as the training inputs to the SOM network. The network consists of a regular grid of neurons. Each neuron in the network has N synapses. Let wj ={wjn|1≤n≤N}, 1≤j≤J, be the synaptic weight vector of the jth neuron in the network, where J is the number of neurons in the network. Figure 1 depicts the formation of the map. We trained the network by the SOM algorithm:

  • Step 1. Randomly select a training vector xi from the corpus.

  • Step 2. Find the neuron j with synaptic weights wj that is closest to xi, i.e.,

  • Step 3. For every neuron l in the neighborhood of neuron j, update its synaptic weights by

    where α(t) is the training gain at time stamp t.

  • Step 4. Increase time stamp t. If t reaches the preset maximum training time T, halt the training process; otherwise decrease α(t) and the neighborhood size, go to Step 1.

Figure 1: The formation of neurons in the map.

The training process stops after time T is sufficiently large so that every vector may be selected as training input for certain times. The training gain and neighborhood size both decrease when t increases.

After training, we perform a labeling process to establish the association between each document and one of the neurons. The labeling process is described as follows. Each document feature vector xi,1≤ i ≤ M is compared to every neuron in the map. We label the ith document to the jth neuron if they satisfy Eq. 1. After the labeling process, each document is labeled to some neuron or, from a different viewpoint, each neuron is labeled with a set of documents. We record the labeling result and obtain the DCM. In the DCM, each neuron is labeled with a list of documents that are considered similar and are in the same cluster.

We explain why the SOM algorithm performs a clustering process here. In the labeling process, those documents that contain similar words will map to the same or neighboring neurons. Since the number of the neurons is usually much smaller than the number of the documents in the corpus, multiple documents may label to the same neuron. Thus, a neuron forms a document cluster. In addition, neighboring neurons represent document clusters of similar meaning, i.e., high word co-occurrence frequency in our context. On the other hand, it is possible that some neurons may not be labeled by any document. We call these neurons the unlabeled neurons. Unlabeled neurons exist when one of two situations occur: 1) when the number of documents is considerably small compared to the number of neurons; or 2) when the corpus contains too many conceptually similar documents such that a great part of documents will fall into a small set of neurons. However, unlabeled neurons will not diminish the result of the clustering since they do not affect the similarity measurement between any pair of clusters.

The map forms the WCM by labeling each neuron with certain words. This is achieved by examining the neurons' synaptic weight vectors. The labeling process is based on the following observations. Since we use binary representation for the document feature vectors, ideally the trained map should consist of synaptic weight vectors with component values near either 0 or 1. Since a value of 1 in a document vector represents the presence of a corresponding word in that document, a component with value near 1 in a synaptic weight vector also shows that such neuron has recognized the importance of the word and tried to "learn" the word. According to such interpretation, we design the following word labeling process. For the weight vector of the jth neuron wj, if its nth component exceeds a predetermined threshold, the corresponding word of that component is labeled to this neuron. To achieve a better result, the threshold is a real value near 1. By virtue of the SOM algorithm, a neuron may be labeled by several words that often co-occurred in a set of documents. Thus, a neuron forms a word cluster. The labeling method may not completely label every word in the corpus. We call these words the unlabeled words. Unlabeled words happen when several neurons compete for a word during the training process. The competition often results in imperfect convergence of weights, As a result, some words may not be learned well; i.e., their corresponding components may not have values near 1 in any neuron's weight vectors. We solve this problem by examining all the neurons in the map and labeling each unlabeled word to the neuron with the largest value of the corresponding component for that word. That is, the nth word is labeled to the jth neuron if

Note that we ignore the unlabeled neurons in Eq. 3.

The WCM autonomously clusters words according to their similarity of co-occurrence. Words that tend to occur simultaneously in the same document will be mapped to neighboring neurons in the map. For example, the translated Chinese words for "neural" and "network" often occur simultaneously in a document. They will map to the same neuron, or neighboring neurons, in the map because their corresponding components in the encoded document vector are both set to 1. Thus, a neuron will try to learn these two words simultaneously. Conversely, words that do not co-occur in the same document will map to distant neurons in the map. Thus, we can reveal the relationship between two words according to their corresponding neurons in the WCM.

Brought to you by Team-Fly

Категории