Efficient Hybrid Distributed Document Clustering Method for Large Datasets

Abstract

ABSTRACT newlineThe growing volume of data to be analyzed enforces novel challenges to the data mining methodologies. Conventional data mining techniques such as clustering assume centralized operation on data and are computationally expensive in terms of execution time. Clustering of large datasets has received considerable attention in the past few decades in several application areas like document categorization and retrieval. newlineThis thesis deals with improving the performance of clustering technique for large high-dimensional distributed document datasets. The challenges addressed are the initial centroids problem and dimensionality problem. These challenges are addressed with an emerging Hadoop-MapReduce model for distributed storage and analysis. This methodology supports processing of large document datasets and proposes solutions for the challenges described by developing distributed clustering algorithms based on this methodology. This thesis proposes three different methods for distributed clustering namely, MapReduce KMeans (MR-KMeans) based distributed document clustering, Distributed document clustering method based on MapReduce PSO-KMeans (MR-PKMeans) and a Hybrid distributed document clustering method (MR-Hybrid). Intensive evaluations are performed resulting in optimized and semantically related document clusters with high quality and speedup. newlineIn the MapReduce K-Means (MR-KMeans) based distributed document clustering method, the algorithm is modeled with an efficient similarity measure using Hadoop framework with the main objective of improving the clustering quality and speedup of localized clustering solution. This method utilizes random initial centroids that converge the result to generate locally optimized clusters. The different stages of clustering process such as similarity calculation, assignment of document to clusters, and recalculation of new cluster centroids are all based on MapReduce methodology. Results on large document datasets show that such a framework with an efficient method of determ

Description

Keywords

Citation

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced