Efficient Data Deduplication for Big Data Storage System
Loading...
Date
item.page.authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In this thesis research work, prime focus is optimizing the deduplication system by adjusting
newlinepertinent factors in content defined chunking (CDC) to identify as key ingredients by
newlinedeclaring chunk cut-points and efficient fingerprint lookup using on-disk secure bucket based
newlineindex partitioning. Firstly, Differential Evolution (DE) algorithm based efficient chunking
newlineapproach is proposed to optimize Two Thresholds Two Divisors (TTTD) CDC known as
newlineTTTD-P; where significantly it reduces the number of computing operations by using single
newlinedynamic optimal parameter divisor D with optimal threshold value T by exploiting the multioperations
newlinenature of TTTD. To reduce the chunk-size variance, TTTD algorithm introduces
newlinean additional backup divisor D` that has a higher probability of finding cut-points. However,
newlineadding an additional divisor D` decreases the chunking throughput, meaning that TTTD
newlinealgorithm aggravates Rabin s CDC performance bottleneck. To this end, Asymmetric
newlineExtremum (AE) CDC significantly improves chunking throughput while providing
newlinecomparable deduplication efficiency by using the local extreme value in a variable-sized
newlineasymmetric window to overcome Rabin CDC and TTTD chunking problem. After AE, an
newlineefficient FastCDC approach is developed using fast gear-based hashing. Therefore, AE and
newlineFastCDC approaches increase the chunking throughput only, but suffers with the problem of
newlinededuplication ratio (DR) for enhancing storage space as being the prime objective of today s
newlinebig data storage systems using Hadoop technology in cloud computing to accommodate
newlinemassive volume of data by eliminating redundant data maximally. Secondly, fingerprint
newlinegeneration stage of data deduplication uses cryptographic secure hash function SHA-1 to
newlinesecure big data storage using key-value store. The key is a fingerprint and the value points to
newlinethe data chunk. Moreover, deduplication technology is also facing technical challenges for
newlineduplicate-lookup disk bottleneck to store complete index of data chunks with their
newlinefingerprints.