Data Mining: Opportunities and Challenges
We have shown how a structured parallel approach can reduce the complexity of parallel application design, and that the approach can be usefully applied to commonly used DM algorithms. The ease of sequential to parallel conversion and the good qualities of code reuse are valuable in the DM field, because of the need for fast prototyping applications and implementation solutions. Performance is achieved by means of careful design of the application parallel structure, with low-level details left to the compiler and the parallel language support. Within the structured parallelism framework, the proposal of external objects aims at unifying the interfaces to different data management services: in-core memory, shared memory, local/parallel file systems, DBMS, and data transport layers. By decoupling the algorithm structure from the details of data access, we increase the architecture independence, and we allow the language support to implement the accesses in the best way, according to the size of the data and the underlying software and hardware layers. These are very important features in the perspective of merging high-performance algorithms into DM environments for large-scale databases. Such a vision is strongly called for in the literature; nevertheless, only sequential DM tools currently address integration issues. On the grounds of the experiments described here with the SkIE environment, we are designing a full support for external objects in the new structured programming environment, ASSIST. Several of the points we have mentioned are still open research problems. Which levels of the implementation will exploit parallelism is one of the questions. The development of massively parallel DBMS systems, and the progressive adoption of parallel file system servers, will both have a profound impact on high performance DM, with results that are not easy to foresee. We believe that high-level parallel languages can also play an important role in the organization and coordination of Grid computational resources into complex applications. Executing collective DM tasks over distributed systems requires finding the right balance between result accuracy, reduction of data movement, and balancing of the computational workload. To prevent us from having to deal with more and more complex management details at the same time, ASSIST will actively support Grid protocols and communication libraries.
| |||||||||||||||||||||||||||||||||
|