Monday, March 13, 2000 Yuhui LIU Department of Computing and Information Sciences, KSU Readings:
Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie...
-
Upload
marianna-watts -
Category
Documents
-
view
237 -
download
2
Transcript of Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie...
![Page 1: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/1.jpg)
Department of Computer ScienceJinan University(暨南大学 )
Exploiting Fingerprint Prefetching to Improve the Performance of Data Deduplication
Liangshan Song, Yuhui Deng, Junjie Xie
1
![Page 2: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/2.jpg)
Agenda
• Motivation• Challenges• Related work• Our idea• System architecture• Evaluation• Conclusion
2
![Page 3: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/3.jpg)
3
• The Explosive Growth of Data Industrial manufacturing, E-commerce, Social network... IDC: 1,800EB data in 2011, 40-60% annual increase YouTube : 72 hours of video are uploaded per minute. Facebook : 1 billion active users upload 250 million photos per day. Greater pressure to the traditional data centers Up to 60% of the data stored in backup system is redundant
Motivation
Image from http://www.buzzfeed.com
3
![Page 4: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/4.jpg)
4
10×1012Bytes/(20×106bits/seconds) =4,000,000seconds=45days
WAN bandwidth: Assume that we want to send 10 TB from U.C. Berkeley
to Amazon in Seattle, Washington. Garfinkel measured bandwidth to S3 from three sites
and found an average write bandwidth of 5 to 18 Mbits/second.
Suppose we get 20 Mbit/sec over a WAN link, it would take:
• S. Garfinkel. An evaluation of amazon’s grid computing services: ec2, s3 and sqs. Tech. Rep. TR-08-07, Harvard University, August 2007.
Amazon would also charge $1000 in network transfer fees when it receives the data.
![Page 5: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/5.jpg)
• Big Data Store Data deduplication ⇒To speed up the process of identifying redundant data chunks,
a fingerprint is calculated to represent each data chunk A table of redundant fingerprints is used to determine whether
a chunk is redundant. the fingerprint information grows with the increase of data .Some fingerprints have to be stored on disk. However, due to the lacking of locality, fingerprints cannot be
effectively cached, and the fingerprints generate random disk accesses.
Fingerprint lookup on-disk becomes a very important overhead in deduplication system. 5
![Page 6: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/6.jpg)
• Chunking Algorithm Fix-Size Partition (FSP): fast and efficient, but
vulnerable to the changes in a file Variable-Size Partition (VSP): CDC (content-defined
chunking) algorithm, SB (sliding block) algorithm, and etc: not vulnerable to the changes in a file
• CDC employs data content within files to locate the boundaries of chunks, thus avoiding the impact of data shifting.
6
![Page 7: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/7.jpg)
• Advantages Save disk space and bandwidth Higher throughput than that of the traditional data compression
methods Save other related cost
7
![Page 8: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/8.jpg)
• Throughput: store data in the given limited window time.
• Disk BottlenecksThe amount of fingerprints grows with the increase of dataTraditional cache algorithms are not effective to handle the
fingerprintsLow cache hit ratio degrades the performance of data
deduplication
Challenges
![Page 9: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/9.jpg)
• Bloom Filter An summary vector in memory excludes unnecessary lookup in advance and avoids extra
disk I/Os
Related work
9
Bloom filter for searching the fingerprint tableBloom Filter( Summer Vector)
![Page 10: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/10.jpg)
• Extreme Binning A hierarchy index policy
10
A two-tier chunk index with the primary index in RAM and bins on disk
![Page 11: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/11.jpg)
• LRU-based Index PartitioningEnforces access locality of the fingerprint lookup in
storing fingerprints
11
LRU-based index partitioning
![Page 12: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/12.jpg)
Our idea: FPP
• A fingerprint prefetching algorithm by leveraging file similarity and data locality
Request fingerprints from disk drives in advance Significantly improve the cache hit ratio,
enhancing the performance of data deduplication
12
![Page 13: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/13.jpg)
• Traditional deduplication system architecture
13
System Architecture
![Page 14: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/14.jpg)
• Chunking Module• Chunking Algorithm Fix-Size Partition (FSP) fast and efficient
Variable-Size Partition (VSP): CDC (content-defined chunking) algorithm, SB (sliding block) algorithm, and etc
not vulnerable to the changes in a file
14
![Page 15: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/15.jpg)
• Fingerprint Generator
Calculate a fingerprint for the chunk Fingerprint: short (128bit), represent the unique chunk expedite the process of chunk comparison Hash Algorithm:MD5,SHA-1
15
![Page 16: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/16.jpg)
• Fingerprint Lookup determining whether the chunk represented by current
fingerprint is repeated two chunks are considered identical if their fingerprints are
the same tends to be time-consuming when the fingerprint table
becomes large
• Exists not store• Not exists store
16
![Page 17: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/17.jpg)
• Similar File Identification Module Identify similar files which share most of
the identical chunks and fingerprints
• Fingerprint Prefetching Module Accelerate the process of fingerprint
lookup
• Sequential Arrangement Module Preserve data locality
17
FPP Deduplication system architecture
![Page 18: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/18.jpg)
• Similar File Identification
• Target: identify similar files which share most of the identical chunks and fingerprints
• Consider fileA is similar to fileB and fileB has been stored before, place the fingerprints of file A in RAM before the process of fingerprint lookup for file B
• most of the lookup will succeed in RAM
18
![Page 19: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/19.jpg)
• Steps: Step1: extract a group of sampling chunks from the target
file Step2: calculate fingerprints for these chunks Step3: compare fingerprints, two files are considered to be
similar if the degree of similarity between fingerprints reaches a certain threshold
19
Sample chunks
![Page 20: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/20.jpg)
• How to sample chunks Step1: calculate Rabin fingerprint for sliding window Step2: if it meets the predefined condition , then over; else move the
sliding window Step3: if the movement exceeds upper threshold ,then over; else go to
step1
20
![Page 21: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/21.jpg)
• Sequential Arrangement traditional cache algorithms are not effective: fingerprints generated
by cryptographic hash function are random fingerprints are stored in accordance with the sequence that files occur
in the data stream.
21
![Page 22: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/22.jpg)
• Fingerprint Prefetching Target: accelerate the process of fingerprint lookup with the
combination of file similarity and locality.
• Two prefetching schemes : schemes 1: all the unique fingerprints of the similar file
from disk into cache. schemes 2: read a portion of fingerprints from the recently
visited location of fingerprint database into cache.
22
![Page 23: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/23.jpg)
Evaluation
• Experiment Setup • Datasets: Dataset1 : 78 files, word documents, pdf documents,
powerpoint presentations etc, 1.4GB Dataset2 : 4 virtual machine disk images, 1.8GB
Hardware: Intel(R) Core(TM) (Dual Core 3.1GHz) with 2GB memory hard disk drive (Seagate, 7200 RPM and 2000 GB).
23
![Page 24: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/24.jpg)
Experiment ResultOverall Performance of Fingerprint Prefetching ① Data Compression Ratio
② Cache Hit Ratio of Fingerprint Access
③ Fingerprint Lookup Time
④ Deduplication Time
Impact of RAM Size on Fingerprint Prefetching
24
![Page 25: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/25.jpg)
① Data Compression Ratio Dataset1 is compressed from 1.4G to 724M Dataset2 is compressed from 1.8G to 1.5G
Result analyse: Dataset1 consists of documents revised and stored with
multiple versions and copies, virtual machine disk images contain less redundant data
25
![Page 26: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/26.jpg)
② Cache Hit Ratio of Fingerprint Access
26
50% 95% for Dataset115% 90% for Dataset2 improve cache hit rate significantly
![Page 27: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/27.jpg)
③ Fingerprint Lookup Time• TL : Total Fingerprint Lookup Time
• TS : Similar File Identification Time,
• TP : Fingerprint Prefetching Time,
• TR : fingerprint retrieval time which does not include the time of Similar File Identification and Fingerprint Prefetching
fingerprint prefetching algorithm is more
effective for big chunk size fingerprint prefetching algorithm is more
effective for big files fingerprint prefetching algorithm is more
effective for big files with small chunk size 27
![Page 28: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/28.jpg)
28
![Page 29: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/29.jpg)
④ Deduplication Time
29
fingerprint prefetching algorithm is effective for big files rather than small files
![Page 30: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/30.jpg)
• Impact of RAM Size on Fingerprint Prefetching
Experiment set:• About Datasets: N-Dataset1 : Dataset1 reduced from 1.4G to 270M N-Dataset2 : Dataset2 reduced from 1.8G to 302M
• Hardware: Intel(R) Core(TM) i5-2520M (Quad core 2.50GHz) with
centos6.3
30
![Page 31: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/31.jpg)
• Change the RAM size from 256M to 1024M• Fingerprint prefetching algorithm obtains a more significant
effectiveness in the case of 256M
31
![Page 32: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/32.jpg)
• Analysis:
For limited RAM, prefetching fingerprints can saves a large amount of time
For limited RAM, fingerprint prefetching algorithm can effectively alleviate the disk bottleneck of data deduplication
For “Big Data”, fingerprint prefetching algorithm can significantly improve the performance of deduplication system
32
![Page 33: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/33.jpg)
Conclusion
• Improve the throughput of data deduplication
help improve cache hit ratio
reduce the fingerprint lookup time
achieve a significant performance improvement of deduplication
33
![Page 34: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/34.jpg)
Future work
• Sample chunks Number of chunks How to better sample chunks
• Identify similar fileHow to identify similar file more accurately
34
![Page 35: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/35.jpg)
References• B. Zhu, K. Li, & H. Patterson, (2008, February). Avoiding the Disk Bottleneck in the Data Domain Deduplication
File System. In Fast (Vol. 8, pp. 269-282). • Bhagwat, K. Eshghi, D. Long, & M. Lillibridge, (2009, September). Extreme binning: Scalable, parallel
deduplication for chunk-based file backup. MASCOTS’09, 2009.• A. Broder, & M. Mitzenmacher, (2004). Network applications of bloom filters: A survey. Internet Mathematics,
1(4), 485-509. • H. Bloom, (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM,
13(7), 422-426. • M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezis, & P. Camble, (2009, February). Sparse Indexing:
Large Scale, Inline Deduplication Using Sampling and Locality. In Fast (Vol. 9, pp. 111-123). • A. Muthitacharoen, B. Chen, & D. Mazieres, (2001, October). A low-bandwidth network file system. In ACM
SIGOPS Operating Systems Review(Vol. 35, No. 5, pp. 174-187). ACM. • Y. Won, J. Ban, J. Min, J. Hur, S. Oh, & J. Lee, (2008, September). Efficient index lookup for De-duplication
backup system. In MASCOTS 2008. • T. Meyer, & J. Bolosky, (2012). A study of practical deduplication.ACM Transactions on Storage (TOS), 7(4), 14. • F. Guo, & P. Efstathopoulos, (2011, June). Building a highperformance deduplication system. In Proceedings of
the 2011 USENIX conference on USENIX annual technical conference (pp. 25-25). USENIX Association. • Y. Tan, H. Jiang, D. Feng, L. Tian, & Z. Yan, (2011, May). CABdedupe: A causality-based deduplication
performance booster for cloud backup services. In IPDPS, 2011.
• B. Debnath, S. Sengupta, & J. Li. ChunkStash: speeding up inline storage deduplication using flash memory. In
Proceedings of the 2010 USENIX conference on USENIX annual technical conference .
![Page 36: Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.](https://reader035.fdocuments.us/reader035/viewer/2022081420/56649dbe5503460f94ab19b3/html5/thumbnails/36.jpg)
Exploiting Fingerprint Prefetching to Improve the Performance of Data Deduplication
36
Thanks!
HPCC 2013: The 15th IEEE International Conference on High Performance Computing and Communications (HPCC 2013)Zhangjiajie, China, November 13-15, 2013