Data Mining: Opportunities and Challenges
In order to improve the training process of the neural network, input data must often be preprocessed. In this case, we refer to preprocessing as the process of standardizing the data according to some method, although the term also includes the "cleaning" of data, i.e., the removal of errors, missing values, and inconsistencies. If data is not preprocessed, the neural network might expend its learning time on variables with large variances or values, and therefore ignore variables with smaller variances or values. Preprocessing of data is a much-discussed method in literature concerning neural networks. Sarle (2001) suggests that the most suitable form of normalization centers the input values around zero, instead of, for example, within the interval [0,1]. This would imply the use of, for example, normalization by standard deviation. Kohonen (1997) suggests two methods for standardizing the data: normalization by standard deviation and heuristically justifiable rescaling. Another method, suggested by Klimasauskas (1991), and used, for example, by Back et al. (1998, 2000), is histogram equalization. Histogram equalization is a way of mapping rare events to a small part of the data range, and spreading out frequent events. This way the network is better able to discriminate between rare and frequent events. Normally, the values should be scaled according to their relative importance. However, according to Kaski and Kohonen (1996), this is not necessary when no differences in importance can be assumed. As this is also the case in this experiment, we have not used any form of scaling according to importance. Instead, the relative importance of the different categories of ratios has been set through the balance of ratios (three in profitability, two in solvency, one in liquidity, and one in efficiency). In this experiment, the data has been standardized according to the variance of the entire dataset (Equations 5 and 6), also the method used by Kaski and Kohonen (1996). where M = number of ratios, N = number of observations, x = value of ratio, and Normalizing the data according to the standard deviation was tried, but the maps obtained using this method were unsatisfactory for clustering purposes. Also, in order to achieve feasible results, a preliminary rescaling had to be done. This implied replacing extreme values with 50 (positive or negative). The reason for this is that a number of the ratios reached incredibly high values, which caused the SOM to illustrate one area with extreme values on the map, with the rest of the map being flat and uninterpretable. Therefore, such extreme values were replaced in this experiment. The SOM_PAK 3.1 program package was used to train the maps in this experiment. SOM_PAK is a program package created by a team of researchers at the Neural Networks Research Centre (NNRC) at the Helsinki University of Technology (HUT). SOM_PAK can be downloaded from http://www.cis.hut.fi/research/som_pak/, and may be freely used for scientific purposes. Readers are referred to the program package for sample data. The maps were visualized using Nenet 1.1, another downloadable program. A limited demo version of Nenet is available at http://koti.mbnet.fi/~phodju/nenet/Nenet/Download.html. Nenet is actually a complete SOM training program, but the demo version is severely limited, and has been used for visualization, since the maps are illustrated in shades of green instead of black and white. In our opinion, this makes the maps easier to interpret. Several hundred maps were trained during the course of the experiment. The first maps were trained using parameters selected according to the guidelines presented in the previous section. The best maps, rated according to quantization error and ease of readability, were then selected and used as a basis when training further maps. The final selected network size was 7 5. We felt that this map size offered the best balance between cluster identification and movement illustration. Clusters were easier to identify than on a 9 6 sized map, and movements within the clusters were easier to identify than on a 5 4 sized map. A smaller map could have been used if a separate map had been trained for each year, but our intention was to use the same map for the entire dataset. A 7 5 sized map seemed large enough to incorporate the data for each year included in the test. The 7 5 lattice also conforms to the recommendation that the x-axis be = 1.3 the y-axis. The simpler neighborhood-set function or bubble, referring to an array of nodes around the winning node, Nc (Kohonen et al., 1996) for hci(t) is preferred over the Gaussian function in the training of smaller maps (Kohonen, 1997), and was therefore used in this experiment as well. The number of steps used in the final phase was generated directly from the recommendations provided in the previous section. Therefore, the initial phase includes 1,750 steps and the final phase 17,500 steps. The learning rate factor was set to 0.5 in the first phase and 0.05 in the second, also as was recommended. The neighborhood radius was set to 12 for the first phase and 1.2 for the second. The initial radius was very large compared to the recommendations, but seemed to provide for the overall best maps. As the first phase is intended for rough training, the initial radius was allowed to be the entire map. In the fine-tuning phase, the radius was reduced considerably. Decreasing the radius in the first phase only resulted in poorer maps. Kohonen (1997) noted that the selection of parameters appears to make little difference in the outcome when training small maps. This also appears to be the case in this experiment. As long as the initial selected parameters remained near the guidelines presented above, the changes in the quantization error were very small, usually as little as 0.001. Some examples of the parameters and outcomes are illustrated in Table 2. These are only a fraction of the entire training set, but illustrate well the small differences in results.
Table 2 shows that the changes in quantization errors are very small, irrespective of the parameters used. The map that was finally chosen was Map 1. It is notable that this map was trained using parameters generated directly from the recommendations above, with the exception of the network radius. Map 7 has a marginally better quantization error, but the difference is negligible, so the map closer to the original recommendations, Map 1, was selected. The appearance of the maps was monitored throughout the experiment, but very small differences in the resulting maps surfaced. Although the maps might look slightly different, the same clusters containing approximately the same companies, were found in the same positions relative to each other. While the "good" end of one map might have been found on the opposite side of another map, the same clusters could still be seen to emerge. This shows the random initialization process of the self-organizing map, but also proves that the results from one map to another are consistent. A single map including all five years of data was trained. By studying the final U-matrix map (Figure 4a), and the underlying feature planes (Appendix) of the map, a number of clusters of companies, and the characteristics of these clusters, can be identified (Figure 4b). The groups and their characteristics are presented as the following:
The characteristics of the groups are summarized in Table 3.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|