A study on de duplication

Abstract

We present two algorithms for calculating string Dis Similarity newlinepercentage of De duplication system Our algorithms are multiple levels of newlineclustering to incorporate constraints for reducing the volume of data and newlineInformation Gain IG for calculating Dis Similarity In our proposed system newlinewe will first separate the records into block sized subset by using clustering newlinealgorithm and applying the subset value to IG Most of the existing algorithm newlinesystems depend on generic or manually tuned distance metrics for estimating newlinethe similarity We ran extensive experiments with huge data and compared newlinethem with various versions of existing algorithms and reported that the new newlinesystem reduces the time consumption for string comparison and higher newlineaverage accuracy than that of the existing systems newlineNone of the existing system produces the dis similarity percentage newlinebetween pair of string in given data set Here we have presented an efficient newlinesolution for calculating string dis Similarity percentage of De duplication newlinesystem by using Multi Level Clustering MLC Information Gain Our newlinealgorithms work in two phases Multi Level Clustering construction and Text newlineDis Similarity calculation Our methods reduce the time consumption for newlinefinding a duplicate record and using smaller amount of memory than the newlineexisting method newline newline

Description

Keywords

Citation

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced