A study on de duplication
Loading...
Date
item.page.authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
We present two algorithms for calculating string Dis Similarity
newlinepercentage of De duplication system Our algorithms are multiple levels of
newlineclustering to incorporate constraints for reducing the volume of data and
newlineInformation Gain IG for calculating Dis Similarity In our proposed system
newlinewe will first separate the records into block sized subset by using clustering
newlinealgorithm and applying the subset value to IG Most of the existing algorithm
newlinesystems depend on generic or manually tuned distance metrics for estimating
newlinethe similarity We ran extensive experiments with huge data and compared
newlinethem with various versions of existing algorithms and reported that the new
newlinesystem reduces the time consumption for string comparison and higher
newlineaverage accuracy than that of the existing systems
newlineNone of the existing system produces the dis similarity percentage
newlinebetween pair of string in given data set Here we have presented an efficient
newlinesolution for calculating string dis Similarity percentage of De duplication
newlinesystem by using Multi Level Clustering MLC Information Gain Our
newlinealgorithms work in two phases Multi Level Clustering construction and Text
newlineDis Similarity calculation Our methods reduce the time consumption for
newlinefinding a duplicate record and using smaller amount of memory than the
newlineexisting method
newline
newline