Multi-level Selective Deduplication for VM Snapshots in Cloud Storage

Multi-level Selective Deduplication for VMSnapshots in Cloud Storage

Wei Zhang*, Hong Tang†, Hao Jiang†, Tao Yang*, Xiaogang Li†, Yue Zeng†

* University of California at Santa Barbara† Aliyun.com Inc.

Motivations

• Virtual machines on the cloud use frequent backup to improve service reliability Used in Alibaba’s Aliyun - the largest public cloud

service in China• High storage demand

Daily backup workload: hundreds of TB @ Aliyun Number of VMs per cluster: 10000+

• Large content duplicates• Limited resource for deduplication

No special hardware or dedicated machines Small CPU& memory footprint

Focus and Related Work

• Previous work Version-based incremental snapshot backup

– Inter-block/VM duplicates are not detected. Chunk-based file duduplication

– High cost for chunk lookup• Focus on

Parallel backup of a large number of virtual disks. – Large files for VM disk images.

Contributions– Cost-constrained solution with very limited computing

resource– Multi-level selective duplicate detection and parallel backup.

Requirements

• Negligible impact on existing cloud service and VM performance

Must minimize CPU and IO bandwidth consumption for backup and deduplication workload

– (e.g. <1% of total resource).

• Fast backup speed Compute backup for 10,000+ users within a few hours

each day during light cloud workload.• Fault tolerance constraint addition of data deduplication should not decrease the

degree of fault tolerance.

Design Considerations

• Design alternatives An external and dedicated backup storage

system.

A decentralized and co-hosted backup system with full deduplication

BackupCloud service

backup

Cloud service

backup

Cloud service. . .

backup

Cloud service

Design Considerations

• Decentralized architecture running on a general purpose cluster• co-hosting both elastic computing and backup

service• Multi-level deduplication Localize backup traffic and exploit data parallelism Increase fault tolerance• Selective deduplication Use minimal resource while still removing most of

redundant content and accomplishing good efficiency

Key Observations

Inner-VM data characteristicsExploit unchanged data to localize

deduplicationCross-VM data characteristics

Small common data dominates duplicates

Zipf-like distribution of VM OS/user dataSeparate consideration of OS and user

VM Snapshot Representation

Data blocks are variable-sized

Segments are fix-sized

Processing Flow of Multi-level Deduplication

Data Processing Steps

• Segment level checkup. Use dirty bitmap to see which segments are modified.

• Block level checkup Divide a segment into variable-sized blocks,

and compare their signatures with the parent snapshot

• Checkup from common dataset (CDS) Identify duplicate chunks from CDS

• Write new snapshot blocks Write new content chunks to stoage.

• Save recipes Save segment meta-data information

Architecture of Multi-level VM snapshot backup

Cluster node

Status& Evaluation

• Prototype system running on Alibaba’s Aliyuan cloud. Based on Xen. 100 nodes and each has 16 cores, 48G memory,

25VMs. Use <150MB per machine for backup&deduplication

• Evaluation data from Aliyuan’s production cluster 41TB. 10 snapshots per VM Segment size: 2MB. Avg. Block size: 4KB

Data Characteristics of the Benchmark

• Each VM uses 40GB storage space on average

• OS and user data disks: each takes ~50% of space

• OS data 7 main stream OS releases: Debian, Ubuntu, Redhat, CentOS, Win2003

32bit, win2003 64 bit and win2008 64 bit.• User data

From 1323 VM users

Impacts of 3-Level Deduplication

Level 1: Segment-level detection within VMLevel 2: Block-level detection within VMLevel 3: Common data block detection across-VM

Impact for Different OS Releases

Separate consideration of OS and user data

Both have Zipf-like data distributionBut popularity growth differs as the cluster size/VM users increase

Commonality among OS releases

1G common OS meta datacovers 70+%

Cumulative coverage of popular user data

Coverage is the summation of covered data block size*frequency

Space saving compared to perfect deduplication as CDS size increases

100G CDS (1GB index) -> 75% of perfect dedup

Impact of dataset-size increase

Conclusions

• Contributions: A multi-level selective deduplication scheme among VM

snapshots– Inner-VM deduplication localizes backup and exposes more

parallelism– global deduplication with a small common data set appeared in

OS and data disks

Use less than 0.5% of memory per node to meet a stringent cloud resource requirement -> accomplish 75% of what perfect deduplication does.

• Experiments Achieve 500TB/hour on a 1000-node cloud cluster Reduce bandwidth by 92% -> 40TB/hour

Multi-level Selective Deduplication for VM Snapshots in Cloud Storage

Documents

Transcript of Multi-level Selective Deduplication for VM Snapshots in Cloud Storage

Celeron® dual-core 2.41GHz CPU, burst up to 2.58GHz tion e omcontent.etilize.com/Manufacturer-Brochure/1031925618.pdf · • Snapshots for backup & restore • Compatible with VM

Deduplication Best Practices - Veeam - Veeam Softwarego.veeam.com/rs/veeam/images/Veeam Webinar - Deduplication Best... · Deduplication Best Practices ... Veeam Backup & Replication

EMC Deduplication Fundamentals

DEDUPLICATION IN YAFFS

Fast and Service-preserving Recovery from Malware ...€¦ · manually make if the malware provides shell access. 2.3 VM-based Malware Recovery Modern VM hypervisors allow for “snapshots”

SNAPSHOTS FEATURE OF ORACLE VM VIRTUALBOX · "SNAPSHOTS" FEATURE OF "ORACLE VM VIRTUALBOX" SUMMARY: "Snapshots" is a feature of "Oracle VM VirtualBox" that allows you …

NetApp @ BASYS · ® TIM AG 2015 TIM AG Ungebremstes Daten Wachstum Snapshots Virtual Storage Tiering RAID-DP Thin Provisioning Cloning Deduplication NetApp Gegenwart Zukunft

Deduplication School

Building a High Performance Deduplication System · 2013. 5. 23. · Deduplication is Maturing •Deduplication (dedupe) is becoming a standard feature of backup/archival systems,

Deduplication & Fusion

understanding deduplication

Deduplication Storage System

Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang*, Hong Tang †, Hao Jiang †, Tao Yang*, Xiaogang Li †, Yue Zeng † * University.

EMC NetWorker Data Domain Deduplication Devices ...nsrd.info/documentation/nw8/NetWorker v8 Data Domain Deduplication...EMC® NetWorker® Data Domain® Deduplication Devices Release

Virtual DR: Disaster Recovery Planning for VMware ......Raw Device Mapping (RDM) •Virtual compatibility mode •VMFS mapping file •Virtualizes physical device I/O •VM snapshots

Dell EMC Data Protection Portfolio€¦ · VM VM VM VM VM VM Home grown & custom apps Virtual Scalable Platforms Intelligent Deduplication Cloud Ready SD or Appliances Reduced Network

Storage Issue One · nology tricks they can implement to get the most out of their storage arrays. Using techniques such as tiered storage, data deduplication, space-efficient snapshots,

Technology AlliAnce PArTner...• 100% VM and virtual disk visibility • Get per-VM snapshots, cloning, and efficient ... virtualization and can actually make the process of assuring

Redesign your Backup with Deduplication - Dell EMC US · Flexible replication Home VM Home VM ... Trust but verify—hope is not a strategy Data verification CheckSum ... Inline Deduplication

HP StoreOnce Deduplication

Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang, Hong Tang †, Hao Jiang †, Tao Yang, Xiaogang Li †, Yue Zeng † * University.