Performance enhancement of hash based parallel deduplication model
Loading...
Date
item.page.authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In the recent years, a man-made digital universe is created by millions of devices such as mobile phones, digital cameras, surveillance cameras, embedded systems and organizations providing solutions for handling this enormous amount of data. This digital universe is increasing twofold every two years and is expected to reach 44 trillion gigabytes by the year 2020. In order to protect and preserve this voluminous data, backup solutions are provided. However, a large proportion as large as 75% of this data contains duplicates. This leads to the need of data reduction techniques that can optimize the storage requirements. Deduplication is an effective data reduction technique that not only removes inter-file and intra-file redundancy but also helps to remove the duplicates among the files and file constituents present across various users and even across organizations. A hash based deduplication split the incoming data stream into fragments called chunks. An identity signature, also called fingerprint is created for each chunk using a cryptographic hash algorithm. A hash indexing structure is used to store the metadata, the fingerprints. The fingerprint insertion and lookup operations are CPU intensive in nature. Moreover, as the size of the incoming data stream increases, the indexing structure also grows leading to frequent disk lookups to access the metadata. Hence, maintaining the indexing structure, improving the fingerprint insertion and lookup operations on the indexing structure and addressing the disk lookup bottleneck problems continue to be the open issues in hash based deduplication.
newline