Efficient Data Deduplication for Big Data Storage System

Efficient Data Deduplication for Big Data Storage System

Files

01_title.pdf (104.85 KB)

02_certificate.pdf (1.82 MB)

03_preliminary pages.pdf (400.2 KB)

04_chapter01.pdf (583.66 KB)

04_chapter02.pdf (1.71 MB)

Abstract

In this thesis research work, prime focus is optimizing the deduplication system by adjusting newlinepertinent factors in content defined chunking (CDC) to identify as key ingredients by newlinedeclaring chunk cut-points and efficient fingerprint lookup using on-disk secure bucket based newlineindex partitioning. Firstly, Differential Evolution (DE) algorithm based efficient chunking newlineapproach is proposed to optimize Two Thresholds Two Divisors (TTTD) CDC known as newlineTTTD-P; where significantly it reduces the number of computing operations by using single newlinedynamic optimal parameter divisor D with optimal threshold value T by exploiting the multioperations newlinenature of TTTD. To reduce the chunk-size variance, TTTD algorithm introduces newlinean additional backup divisor D` that has a higher probability of finding cut-points. However, newlineadding an additional divisor D` decreases the chunking throughput, meaning that TTTD newlinealgorithm aggravates Rabin s CDC performance bottleneck. To this end, Asymmetric newlineExtremum (AE) CDC significantly improves chunking throughput while providing newlinecomparable deduplication efficiency by using the local extreme value in a variable-sized newlineasymmetric window to overcome Rabin CDC and TTTD chunking problem. After AE, an newlineefficient FastCDC approach is developed using fast gear-based hashing. Therefore, AE and newlineFastCDC approaches increase the chunking throughput only, but suffers with the problem of newlinededuplication ratio (DR) for enhancing storage space as being the prime objective of today s newlinebig data storage systems using Hadoop technology in cloud computing to accommodate newlinemassive volume of data by eliminating redundant data maximally. Secondly, fingerprint newlinegeneration stage of data deduplication uses cryptographic secure hash function SHA-1 to newlinesecure big data storage using key-value store. The key is a fingerprint and the value points to newlinethe data chunk. Moreover, deduplication technology is also facing technical challenges for newlineduplicate-lookup disk bottleneck to store complete index of data chunks with their newlinefingerprints.

URI

http://hdl.handle.net/10603/305313

Collections

Computer Engineering

Full item page

Efficient Data Deduplication for Big Data Storage System

Files

Date

item.page.authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced