Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage...

20
Designing a Storage Infrastructure for Scalable Cloud Services Byung Chul Tak , Chunqiang Tang ⋆⋆ , and Rong N. Chang ⋆⋆ Pennsylvania State Universtiy - University Park, PA ⋆⋆ IBM T.J. Watson Research Center, Hawthrone, NY [email protected],{ctang,rong}@us.ibm.com Abstract. In an IaaS (Infrastructure-as-a-Service) cloud services, storage needs of VM (Virtual Machine) instances are met through virtual disks (i.e. virtual block devices). However, it is nontrivial to provide virtual disks to VMs in an efficient and scalable way for a couple of reasons. First, a VM host may be required to provide virtual disks for a large number of VMs. It is difficult to ascertain the largest possible storage demands and physically provision them all in the host machine. On the other hand, if the storage spaces for virtual disks are provided through remote storage servers, aggregate network traffic due to storage accesses from VMs can easily deplete the network bandwidth and cause congestion. We propose a system, vStore, which overcomes these issues by utilizing the host’s limited local disk space as a block-level cache for the remote storage in order to absorb network traffics from storage accesses. This allows the VMM (Virtual Ma- chine Monitor) to serve VMs’ disk I/O requests from the host’s local disks most of the time, while providing the illusion of much larger storage space for creating new virtual disks. Caching is a well-studied topic in many different contexts, but caching virtual disks at block-level poses special challenges in achieving high performance while maintaining virtual disk semantics. First, after a disk write operation finishes from the VM’s perspective, the data should survive even if the host immediately encounters a power failure. Second, as disk I/O performance is dominated by disk seek times, it is important to keep a virtual disk as sequential as possible in the limited cache space. Third, the destaging operation that sends dirty pages back to the remote storage server should be self-adaptive and mini- mize the impact on the foreground traffic. We propose techniques to address these challenges and implemented them in Xen. Our evaluation shows that vStore pro- vides the illusion of unlimited storage space, significantly reduces network traffic, and incurs a low disk I/O performance overhead. Key words: Virtual machine, Virtual disk, Cloud storage, 1 Introduction Cloud Computing has recently drawn a lot of attention from both industry and re- search community. The success of Amazon’s Elastic Compute Cloud (EC2) services [2] has demonstrated the practicality of the concept and its potential as a next paradigm for enterprise-level services computing. There are also other competitors providing

Transcript of Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage...

Page 1: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

Designing a Storage Infrastructure for ScalableCloud Services

Byung Chul Tak⋆, Chunqiang Tang⋆⋆, and Rong N. Chang⋆⋆

⋆ Pennsylvania State Universtiy - University Park, PA⋆⋆ IBM T.J. Watson Research Center, Hawthrone, NY

[email protected],{ctang,rong}@us.ibm.com

Abstract. In an IaaS (Infrastructure-as-a-Service) cloud services, storage needsof VM (Virtual Machine) instances are met through virtual disks (i.e. virtual blockdevices). However, it is nontrivial to provide virtual disks to VMs in an efficientand scalable way for a couple of reasons. First, a VM host may be required toprovide virtual disks for a large number of VMs. It is difficult to ascertain thelargest possible storage demands and physically provision them all in the hostmachine. On the other hand, if the storage spaces for virtual disks are providedthrough remote storage servers, aggregate network traffic due to storage accessesfrom VMs can easily deplete the network bandwidth and cause congestion.We propose a system,vStore, which overcomes these issues by utilizing the host’slimited local disk space as a block-level cache for the remote storage in order toabsorb network traffics from storage accesses. This allows the VMM (Virtual Ma-chine Monitor) to serve VMs’ disk I/O requests from the host’s local disksmostof the time, while providing the illusion of much larger storage space for creatingnew virtual disks. Caching is a well-studied topic in many different contexts, butcaching virtual disks at block-level poses special challenges in achieving highperformance while maintaining virtual disk semantics. First, after a disk writeoperation finishes from the VM’s perspective, the data should surviveeven if thehost immediately encounters a power failure. Second, as disk I/O performance isdominated by disk seek times, it is important to keep a virtual disk as sequentialas possible in the limited cache space. Third, the destaging operation that sendsdirty pages back to the remote storage server should be self-adaptive and mini-mize the impact on the foreground traffic. We propose techniques to address thesechallenges and implemented them in Xen. Our evaluation shows that vStore pro-vides the illusion of unlimited storage space, significantly reduces network traffic,and incurs a low disk I/O performance overhead.

Key words: Virtual machine, Virtual disk, Cloud storage,

1 Introduction

Cloud Computing has recently drawn a lot of attention from both industry and re-search community. The success of Amazon’s Elastic Compute Cloud (EC2) services [2]has demonstrated the practicality of the concept and its potential as a next paradigmfor enterprise-level services computing. There are also other competitors providing

Page 2: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

2 Designing a Storage Infrastructure for Scalable Cloud Services

different types of Cloud services such as Microsoft’s Azure[18] and Google’s Ap-pEngine [8]. Research community has also recognized the importance and started tobuild several prototype Cloud computing systems [6, 22] forresearch purposes.

In this paper, we design a scalable architecture that provides reliable virtual disks(i.e., block devices as opposed to object stores) for virtual machines (VM) in a Cloudenvironment. Much of the challenges arise from the scale of the problem. On the onehand, a host potentially needs to provide virtual disks as a virtual block device to largenumber of virtual machines running on that host, which wouldincur a high cost if everyhost’s local storage space is over-provisioned for the largest possible demand. On theother hand, if the storage spaces for virtual disks are provided through the networkattached storage (NAS), it may cause congestion at the network and/or storage serversin a large-scale Cloud environment.

We propose a system, calledvStore, which addresses these issues by using the host’slimited local disks as ablock-levelcache for the network attached storages. This cacheallows the hypervisor to serve VMs’ disk I/O requests using the host’s local disks mostof the time, while providing the illusion of unlimited storage space for creating new vir-tual disks. Caching is a well-studied topic in many different contexts, but it poses specialchallenges in achieving high performance while maintaining virtual disk semantics.

First, theblock-levelcache must preserve the data integrity in the event of hostcrashes. After a disk write operation finishes from the VM’s perspective, the data shouldsurvive even if the host immediately encounters a power failure. This requires that cachehandling operations must always ensure consistency between on-disk metadata and datato avoid committing incorrect data to the network attached storage during recovery froma crash. A naive solution could easily introduce high overheads in updating on-diskmetadata. Second, unlike memory-based caching schemes, the performance of an on-disk cache is highly sensitive to data layout. It requires a cache placement policy thatmaintains a high degree of data sequentiality in the cache asin the original (i.e. remote)virtual disk. Third, the destaging operation that sends dirty blocks back to networkattached storage should be self-adaptive and minimize the impact on foreground I/Ooperations.

To our knowledge, this is the first study that systematicallyaddresses issues incaching virtual disks atblock-level. Log-structured file system [23] and DCD [13] uselocal disk as a write log buffer rather than as a generic cachethat can handle normalread/write operations. Dm-cache [26] is the closest to vStore. It uses one block device tocache another block device, but may corrupt data in the eventof host crashes, becausethe metadata and data are not always kept consistent on disk.FS-Cache [12] in theLinux kernel implements caching at the file system level instead of at the block devicelevel, and it also may corrupt data in the event of host crashes, due to the same reason.Network file systems such as NFS [25], AFS [15], sprite [19], and coda [14] employdisk caching at the file-level. If hypervisor uses these file systems as a remote storage,virtual disk images has to exist in the remote file system as large-sized files. But sincecaching is done at a file granularity (or partial file granularity), file system level cachingmay end up storing all or large part of the entire virtual diskimages, which local disksmay not accommodate due to limited spaces.

Page 3: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

Designing a Storage Infrastructure for Scalable Cloud Services 3

VMM

��� ��� �������� ��…

VMM

��� ��� �������� ��…

VMM

��� ��� �������� ��…

������ ������������ �������� ������������ ����� ����� ������ ����� �����Fig. 1.Our model of scalable cloud storage architecture with vStore.

We have implemented vStore in Xen environment. Our evaluation shows that vStorecan handle various workloads effectively. It provides the illusion of unlimited storage,significantly reduces network traffic, and incurs a low disk I/O performance overhead.

The rest of this paper is organized as follows. In Section 2, we give general back-ground and motivates the work. Section 3 presents the designof vStore. Section 4 de-scribes vStore’s implementation details. In Section 5, we empirically evaluate variousaspects of vStore. We discuss related work in Section 6, and state concluding remarksin Section 7.

2 Cloud Storage Architecture

Fig. 1 shows architectural model of a scalable Cloud storagesystem. It consists of thefollowing components:

• VM-hosting machines:A physical machine hosts a large number of VMs and haslimited local storage space. vStore uses local storage as ablock-levelcache and pro-vides to VMs the illusion of unlimited storage space. The hypervisor provides a vir-tual block device to VMs which implies that VMs see raw block devices and theyare free to install any file systems on top of it. Thus, hypervisor receives block-levelrequests and has to redirect it to the remote storage.

• Storage Server Cluster:Storage server clusters provide network attached storage tophysical machines. They can be either dedicated high-performance storage servers ora cluster of servers using commodity storage devices. The interface to the hypervisorscan be either block-level or file-level. If it is the block-level, iSCSI type of protocolis used between storage servers and clients (i.e. hypervisors). If it is file-level, the hy-pervisor mounts a remote directory structure and keeps the virtual disks as individualfiles. Note that regardless of the protocol between hypervisors and storage servers,the interface between VMs and hypervisor remains at block-level.

• Directory Server: The directory server holds the location information about the stor-age server clusters. When a hypervisor wants to connect to a specific storage server,it consults the directory server to determine the storage server’s address.

Page 4: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

4 Designing a Storage Infrastructure for Scalable Cloud Services

• Networking Infrastructure: Usually network bandwidth within a rack is well-provisioned,but cross-rack network is usually 5—10 times under-provisioned than that of within-rack network [4].

2.1 Motivations

There are multiple options when designing a storage system for a Cloud. One solutionis to use only local storage. In a Cloud, VMs may use differentamounts of storagespace, depending on how much the user pays. If every host’s local storage space is over-provisioned for the largest possible demand, the cost wouldbe prohibitive. Anothersolution is to only use network attached storage. That is, a VM’s root file system, swaparea, and additional data disks are all stored on network attached storage. This solution,however, would incur an excessive amount of network traffic and disk I/O load on thestorage servers.

Sequential disk access can achieve a data rate of 100 MB/s. Even with pure randomaccess, it can reach 10 MB/s. Since 1 Gbps network can sustainroughly about 13 MB/s,four uplinks to the rack-level switch are not enough to handle even one single sequentialaccess. Note that uplinks to the rack-level network switches are limited in numbers andcannot be easily increased in commodity systems [4]. Even for random disk access, itcan only support about five VMs’ disk I/O traffic. Even with 10 Gbps networks, it stillcan hardly support thousands of VMs running in one rack (e.g., typical numbers are 42hosts per rack, and 32 VMs per host, i.e., 1,344 VMs per rack).The seriousness of thebandwidth limitation due to hierarchical networking structures of the data centers hasbeen recognized and there are continuing efforts to resolvethis issue through network-related enhancements [1, 10]. However, we believe it is alsonecessary to address theissue from systems perspective using techniques such as vStore.

2.2 Our Solution

vStore takes a hybrid approach that leverages both local storage and network attachedstorage. It still relies on network attached storage to provide sufficient storage space forVMs, but utilizes the local storage of a host to cache data andavoid accessing networkattached storage as much as possible. Consider the case of Amazon EC2, where a VMis given one 10GB virtual disk to store its root file system andanother 160GB virtualdisk to store data. The root disk can be stored on local storage due to its small size.The large data disk can be stored on network attached storageand accessed through thevStore cache.

Data integrity and performance are two main challenges in the design of vStore.After a disk write operation finishes from the VM’s perspective, the data should sur-vive even if the host immediately encounters a power failure. In vStore, system failurescan compromise data integrity in several ways. If the host crashes while vStore is inthe middle of updating either the metadata or the data and there is no mechanism fordetecting the inconsistency between the metadata and the data, after the host restarts,incorrect data may remain in the cache and be written back to the network attachedstorage. Another case that may compromise data integrity isthrough violating the se-mantics of writes. If data is buffered in memory and not flushed to disk after reporting

Page 5: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

Designing a Storage Infrastructure for Scalable Cloud Services 5

write completionto the VM, a system crash will cause data loss. This observation aboutsemantics constrains the design of vStore and also affects the overall performances.

The second challenge is to achieve high performance, which conflicts with ensuringdata integrity and hence requires a careful design to minimize performance penalties.The performance of vStore is affected by several factors:(i) data placement withinthe cache,(ii) vStore metadata placement on disk,(iii) complication introduced by thevStore logic. For(i), if sequential blocks in a virtual disk are placed far apart in thecache, a sequential read of these blocks incurs a high overhead due to a long disk seektime. Therefore, it is important to keep a virtual disk as sequential as possible in thelimited cache space. For(ii) , ideally, on-disk metadata should be small and should notrequire an additional disk seek to access data and metadata separately. For(iii) , onepotential overhead is the dependency among outstanding requests. For example, if onerequest is about to evict one cache entry, then all the requests on that entry must wait.All of these factors affected the design of vStore.

3 System Design

3.1 System Architecture

VM

BlockRequests

In-memorymetadata

Local Storage

Remote Storage

vStore Module

WaitQueue

CacheHandlingLogic

Cache Space

Fig. 2.The architecture of vStore

The architecture of vStore is shownin Fig. 2. Our discussion is based onpara-virtualized Xen [3]). VMs gener-ate block requests in the form of (sec-tor address, sectorcount). Requestsarrive at the front-end device driverwithin the VM after passing throughthe guest kernel. Then they are for-warded to the back-end driver in Domain-0. The back-end driver issues actual I/O re-quests to the device, and send responses to the guest VM alongthe reverse path.

The vStore module runs inDomain-0, and extends the function of the back-end de-vice driver. vStore intercepts requests and filters them through its cache handling logic.In Fig. 2, vStore internally consists of a wait queue for incoming requests, a cachehandling logic, and in-memory metadata. Incoming requestsare first put into vStore’swait queue. The wait queue is necessary because the cache entry that this request needsto use might be under eviction or update triggered by previous requests. After clear-ing such conflicts, the request is handled by the cache handling logic. The in-memorymetadata are consulted to obtain information such as block address, dirty bit, and modi-fication time. Finally, depending on the current cache state, actual I/O requests are madeto either the cache on local storage or the network attached storage.

I/O Unit: Guest VMs usually operate on 4KB blocks, but vStore can perform I/Os toand from the network attached storage at a configurable larger unit. A large I/O unitreduces the size of in-memory metadata, as it reduces the number of cache entries tomanage. Moreover, a large I/O unit works well with high-end storage servers, whichare optimized for large I/O sizes (e.g., 256 KB or even 1 MB). Thus, reading a large

Page 6: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

6 Designing a Storage Infrastructure for Scalable Cloud Services

Fields Size Descriptions

Virtual2 Bytes

ID assigned by vStore to uniquely identify a virtual disk.Disk ID An ID is unique only within individual hypervisors.

Sector Address 4 Bytes Cache entry’s remote address in unit of sector.Dirty Bit 1 Bit Set if cache content is modified.Valid Bit 1 Bit Set if cache entry is being used.Lock Bit 1 Bit Set if under modification by a request.

Read Count 2 Bytes How many read accesses within a time unit.Write Count 2 Bytes How many write accesses within a time unit.

Bit Vector VariableEach bit represents 4KB within theblock group. Set ifcorresponding 4KB is valid. The size is (block group)/4KB bits.

Access Time 8 Bytes Most recently accessed time.

Total Size < 23 BytesTable 1.vStore Metadata.

unit is as efficient as reading 4KB. This may increase the incoming network traffic, butour evaluation shows that the subsequent savings outweigh the initial cost (as shown inFig. 8 (b) of Section 5). We use the term,block group, to refer to the I/O unit used bythe vStore as opposed to the (typically 4KB) block used by theguest VMs. That is, oneblock groupcontains one or more 4 KB blocks.

Metadata: Metadata holds information about cache entries on disk. Metadata arestored on disk for data integrity and cached in memory for performance. Metadataupdates are done in a write-through manner. After a host crashes and recovers, vS-tore visits each metadata entry on disk and recovers any dirty data that have not beenflushed to network attached storage. Table 1 summarizes the metadata fields.

Virtual Disk ID identifies a virtual disk stored on network attached storage. When avirtual disk is detached and reconnected later, cached contents that belong to this diskis identified and reused.Bit Vector identifies valid 4KB blocks within ablock group.Without Bit Vector, on a 4Kwrite, vStore has to read the correspondingblock groupfrom network attach storage, merges with the 4K new data, andwrites the entire blockgroup to cache. WithoutBit Vector, it can write to the 4K data directly without fetchingthe entire block group. Our experiments show thatBit Vectorhelps significantly reducenetwork traffic when using a large cache unit size.

Maintaining the metadata on disk may cause poor performance. A naive implemen-tation may require two disk accesses to handle one write request issued by a VM—onefor metadata update and one for writing actual data. vStore solves this problem byputting metadata and data together, and updates them in a single write. The details aredescribed in Subsection 3.1.

In-memory Metadata:To avoid disk I/Os for reading the on-disk metadata, vStoremaintains a complete copy of the metadata in memory and updates them in a write-through manner. We use a largeblock groupsize (e.g., 256KB) to reduce the size of thein-memory metadata.

Page 7: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

Designing a Storage Infrastructure for Scalable Cloud Services 7

Cache Structure: vStore organizes local storage as a set-associative cache with write-back policy by default. We describe the cache as a table-likestructure, where acachesetis a column in the table, and acache rowis a row in the table. Acache rowconsistsof multiple block groups, each of which with contents possibly coming from differentvirtual disks.Block groups in the samecache roware laid out in logically contiguousdisk blocks.

BlockGroup

metadata4KB block hash

Total size = n·(4096+512) bytes

Trailer 512 bytes

data4KB block hash

Fig. 3. Structure of one cache entry. Theblockgroupconsists ofn number of 4KB blocks andeach 4KB blocks have trailers.

Each 4K block in ablock grouphasa 512-byte trailer shown in Fig. 3. Thistrailer contains metadata and the hashvalue of the 4KB data block. On awriteoperation, vStore computes the hash ofthe 4KB block, and writes the 4KB blockand its 512-byte trailer in a single writeoperation. If the host crashes during thewrite operation, after recovery, the hashvalue helps detect that the 4KB block andthe trailer are inconsistent. The 4K blockcan be safely discarded, because the com-pletion of the write operation has not been acknowledged to the VM yet.

When handling a read request, vStore also reads the 512-byte trailer together withthe 4KB block. As a result, a sequential read of two adjacent blocks issued by the VMis also sequential in the cache. If only the 4K data block is read without the trailer, thesequential request would be broken into two sub-requests, spaced apart by 512 bytes,which adversely affects the performance. Since the size of a512-byte trailer is 12.5%of the size of a 4K block, theoretically the overhead of vStore is around 12.5%. Manyof our design and implementation efforts have been directedto achieve this theoreticalbound.

Cache Replacement:Simple policies like LRU and LFU are not suitable for vStore,because they are designed primarily for memory-based cachewithout consideration ofblock sequentiality on disk. If two consecutive blocks in a virtual disk are placed at tworandom locations in vStore’s cache, sequential I/O requests issued by the VM becomerandom accesses on the physical disk. vStore’s cache replacement algorithm strives topreserve the sequentiality of a virtual disk’s blocks.

Below, we describe vStore’s cache replacement algorithm indetail. We introducethe concept ofbase cache rowof a virtual disk. Thebase cache rowis the defaultcacherow on which the first row of blocks of a virtual disk is placed. Subsequent blocks of thevirtual disk are map to the subsequentcache rows. For example, if there are two virtualdisksDisk1 andDisk2 currently attached to the vStore and the cache associativity is 5(i.e. there are 5cache rows), thenDisk1 might be assigned 1 as abase cache rowandDisk2 be assigned 3 to keep them reasonably away from each other. Ifwe assume onecache rowis made of ten 128KBcache groups, Disk2’s block at address 1280K willbe mapped to row 4 which is the next row fromDisk2’s base cache row.

Upon arrival of new data block, vStore determines the cache location in two steps.First, it looks at the cache entry’s state whose location is calculated using thebase cacherow and the block’s address. If it is invalid or not dirty, then itis immediately assigned

Page 8: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

8 Designing a Storage Infrastructure for Scalable Cloud Services��� !"# $%&'()*)+,-./&)*/ %,*- 0��1234!-"%56.

7%"5, 5& )*)+,%,8&5, %,*- 0��1234! %,8&5, %,*- 0/&)*/ %,*- 9%,8&5, :%"5, 97%"5, 5& )*)+,��1234!8,%$,

6,#6,# ;& ;&"# <= >/&)=)*)+,-. ;&/&)*/ %,*- 0 %,8&5, %,*- 08,%$, ?,%$,-"%56 <=7%"5, 5& )*)+, ��1234!(*%5 "*//6@*/"-.%,8&5, %,*- 96,# ABCDE FGH IJKLMNONPQRSTKNOT UJGVQ F

WEDXBYRGJVZS

[JGVQ \] ^TKN]VK NONPQWEDXBYTKNOT JQOR _JQ`KVQUJGVQ _[JGVQ \] ^TKN]VK NONPQ

ZQHZQH aK aK MOJV GOTTZbOTGRSJQ`KVQ JQOR _ZQH)*)+, +"5 )*)+, 8"## :"5+&'5 c/'#+ )*)+, 8"## :"5+ c/'#+7%"5, 5& )*)+, NONPQ PGV NONPQ GHHUGVPKLV dTLHP NONPQ GHH UGVP dTLHPWEDXBY

(a) READ request handling (b) WRITE request handling

Fig. 4. Flow diagram of cache handling steps.Y is the address of the cache entry that is selectedto be evicted. Local read/write means reading or writing the vStore cache.Remote read/writemeans reading or writing network attached storage.

to the cache entry. If dirty, a victim entry is selected basedon the scores. Six criteria areused to calculate the score.

• Recentness - More recently accessed, higher the score.

• Prior Sequentiality - This means how sequential the cache entry is with adjacentcache entries. If the cache entry is already sequential, then we prefer to keep it.

• Prior Distance - This means how far away the cache entry is from the defaultbasecache row. If the entry is located in cache row 2 and the defaultbase cache rowofthe virtual disk is 1, then the value is2 − 1 = 1.

• Posterior Sequentiality - This means how sequential it willbe if we cache new block.If it becomes sequential, then we prefer this cache entry as avictim.

• Posterior Distance - This means how far away from the defaultbase cache rowitwould be if we cache new block. If this distance is far, it is less preferable.

• Dirtiness - If the cache entry is modified, we would like to avoid evicting this entryas much as possible.

Let xi be each of the six criteria described above. The score is computed using thisequation. S = a0 · x0 + a1 · x1 + ... + a5 · x5 (1)

Here the coefficientai represents the weight of each criterion. If allai is 0 except fora5, the eviction policy becomes equivalent to LRU. Weight coefficients are adjustableaccording to the preference. This value is computed for all the cache entry within thecache set and the entry with the lowest score is chosen for eviction.

Cache Handling Operations: There are three cases in cache handling -cache hit, misswithout flushandmiss with flush. First issue in designing cache handling operations isperformance. Since vStore uses disk as a cache space, cache handling requires more

Page 9: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

Designing a Storage Infrastructure for Scalable Cloud Services 9

disk access than when cache were not used. Excessive disk accesses will degrade theoverall performance and reduce the merit of using vStore. Although we cannot achievethe same I/O performance as when vStore were not used, we wantto make the perfor-mance loss tolerable by minimizing disk accesses. Second issue is data integrity. Asdescribed, our design choice is to add 512 byte trailer to each 4K blocks to record hashof it. And, in order to minimize disk I/O, we always read and write the trailer together.This only increases data size, but does not increase the number of I/O. However, forcache miss handling, it is inevitable to introduce additional disk I/O for data integrity.In general, such consistency issue complicates overall cache handling and there is atrade-off between maintaining consistency and performance penalty due to additionaldisk I/O.

READ Handling: The biggest difference of read handling in Fig. 4 (a) from write han-dling is that vStore can return the data as soon as it is available and continue the rest ofthe cache operations in background. This is reflected in the miss handling operations.Remote read is initiated first. As soon as vStore finishes reading the requested block,it returns with the data. On-disk metadata update and cache data write is performedafterwards.

WRITE Handling:Forwrite requests, vStore directly writes the data to the cache with-out accessing the network attached storage. This simplifiesthecache hitandcache misswithout flush. But, write handling forcache miss with flushis expensive because it hasto make several I/O requests. A constraint from the semantics of write prevents vStorefrom returning early asread. Premature return before metadata and data are persistedrisks metadata inconsistency on system crash. Fig. 4 (b) shows thatwrite handling canreturn only at the end of entire operation sequences. In the worst case,write handlingincurs at most four disk I/Os, which occurs in the case of cache miss with flush.

3.2 Destaging

Destaging refers to the process of flushing dirty data in the cache to the network attachedstorage. We introduce the destaging functionality for the following reasons.

• We want to keep the proportion of dirty blocks under a specified level. Large num-ber of dirty blocks is potentially harmful to the performance because evicting a dirtycache entry delays the cache handling operations significantly due to flushing opera-tions.

• Detachment of a virtual disk can be faster when there are lessnumber of dirty blocks.If a VM wants to terminate or migrate, it has to detach the virtual disk. As part of thedetachment process, all the dirty blocks belonging to the detaching storage has to beflushed. Without destaging, the amount of data that has to be transferred can be aslarge as orders of several gigabytes. Transferring that amount of data takes time andalso generates bursty traffic.

Mechanism Design: Destaging is triggered when the number of dirty blocks in thecache exceeds the user-specified level, which we call thepollution levelIf the pollution

Page 10: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

10 Designing a Storage Infrastructure for Scalable Cloud Services

level is set to be 65%, it means that user wants to keep the ratio of dirty blocks to totalblocks below 65%.

Upon destaging, vStore have to determine how many blocks to destage at a giventime t. Basic idea is to maintain a window sizewt which indicates the total alloweddata transmission size in unit of bytes per millisec. This window size is the combineddata transmission size for both normal remote storage accesses and the destaging. It isspecified as a rate (Bpms) since destaging action can be fired at irregularly. Intuitively,if wt increases, then it will be more likely that normal network attached storage accesswill leave more bandwidth available for destaging.

Control technique forwt in vStore adopts the technique used for flow control inFAST TCP [27] and for queue lengths adjustment in PARDA [9]. Let us first focus onadjustingwt using the network attached storage latency. LetR be the desired networkattached storage latency. LetRt be the exponentially weighted moving average of ob-served network attached storage latency expressed asRt = (1−α)R + αRt−1 , whereα is a smoothing factor. We calculatewt using

wt = (1 − γ)wt−1 + γR

Rt

wt−1 (2)

whereγ is another smoothing factor forwt. If observed remote latency is smaller thanR, thenwt will increase and vice versa. In vStore, we also consider thelocal latencydenoted asvt. If we let Lt be the latency of local disk, we calculatevt asvt = (1 −

γ)vt−1 + γ L

Lt

vt−1. We take the minimum ofwt andvt as the window size. Next wecalculate how manyblock groupsto destage using determined window size. Letdt

denote the number of destage I/O to perform at timet, then

dt = (min(vt, wt) × τt − Ct)/B (3)

whereτt is time length betweent andt − 1 in millisec,B theblock groupsize andCt

pending I/O requests at timet in bytes.Ct represents the remote access from normalfile system operations. Destaging will happen only ifdt > 0. Destaging behavior of ourimplementation following this design is explained in Section 5.6.

4 Implementation

Ker

nel

Guest Domain Domain-0

Block front Block tap

Tapdisk ProcessBlocktapControlProcess

ApplicationProcess

LocalDisk

vStoremodule

Request &Response

Submit & Poll

Asynchronous I/ORemoteDisk

Use

r S

pace

Reador Write

Create

Linux aio Library

Fig. 5. Implementing vStore using Xenblktapmechanism.

We implemented vStore us-ing Xen’sblktap interface [28]as shown in Fig. 5. Blk-tap mechanism redirects aVM’s disk I/O requests to atapdisk process running in theuserspace of Domain-0. In apara-virtualized VM, user ap-plication reads or writes to theblkfront device. Normallyblk-front connects to theblkbackand all the block traffics are delivered to it. If blktap is enabled,blktapreplacesblkbackand all the block traffics are now redirected to the tapdisk process. Overall theblktap

Page 11: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

Designing a Storage Infrastructure for Scalable Cloud Services 11

mechanism provides convenient method to intercept block traffics and implement newfunctionalities in the user space.

Xen ships with several types of tapdisks so that tapdisk process can open the blockdevice using the specified disk type. Disk types are simply a set of callback functionssuch as open, close, read, write, docallback and submit. Among several disk types, syn-chronous I/O type uses normal read, write system calls to handle each incoming blockI/Os. AIO-based disk type uses Linux AIO library to issue multiple block requests in abatch. vStore also implements those predefined set of callback functions and registers totapdisk as another type of tapdisk. vStore is based on the asynchronous I/O mechanism.vStore submits requests to the Linux AIO library and periodically polls for completedI/Os. Thus, internal structure of vStore is an event-drivenarchitecture. We have alsoimplemented a vStore using synchronous I/O. However, sincethe entire block requestsfrom the VM are serialized, it shows low performance. We mainly use AIO version forthe evaluation.

5 Experimental Evaluation

5.1 Experimental Settings

VM Host Local Disk Storage Server Disk

Disk ModelWestern Digital Seagate

WD400BD-75MRA1 ST3500320ASInterface SATA 1.5Gb/s SATA 3.0Gb/s

Storage Capacity 40 GB 500 GBRPM 7,200 7,200

Average Latency 4.2 msec 4.16 msecAverage Seek Time 8.9 msec 8.5 msec

Sequential Read Speed 58.2 MB/sec 48.7 MB/sec

Table 2. Specifications of hard disks used in the experi-ments.

The experimental environmentconsists of rack-mounted serversfor hosting virtual machinesand remote storage servers toprovide storage services to theVMs.

Virtual Machines: The rackcontains more than 20 serversand Xen-3.1.4 hypervisor is in-

stalled on each of them. Servers have two Intel(R) Xeon(TM) CPU of 3.40 GHz andhave 2G bytes of memory. They can communicate through 1Gbps link within the rack.Local storage for each server is about 40G bytes and they havea NFS-mounted sharedstorage space that is used to hold VM images for all Virtual Machines. This storagespace is merely used to hold VM images and does not represent the remote storageservers that we described earlier. Remote storage service is provided through separatestorage server using TCP. Volumes provided by the storage server are seen as separateblock devices within the VM and all the experiments are done on these block devices.

Each rack-mounted server is configured to run three VMs. EachVM has 512Mbytesof memory and 4G bytes of base storage space. VM’s kernel useslinux 2.6.18-8 versionand Linux distribution package is Fedora Live 7 version. Each VM is given a 4GBvirtual disk.

Remote Storage Server:For the role of the remote storage server, we set up a separatemachine located out of the rack and in a different sub-network. It has an AMD Athlon64 X2 Dual Core Processor 4200+ and 4 GBytes of memory. There are four physicalhard disks attached through SATA interface and the combinedcapacity is 1.2 TBytes.

Page 12: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

12 Designing a Storage Infrastructure for Scalable Cloud Services

1k

2k

10 20 30 40 50 (sec)

Num

ber

of R

eque

sts

READWRITE

1k

3k

10 20 30 40 50 (sec)

Num

ber

of R

eque

sts

READWRITE

(a) Filebench webserver (b) Filebench varmail

1k

3k

10 20 30 40 50 (sec)

Num

ber

of R

eque

sts

READWRITE

1k

3k

5k

7k

10 20 30 40 (sec)

Num

ber

of R

eque

sts

READWRITE

(c) Filebench fileserver (d) Postmark

Fig. 6. Visualization of workload patterns for four different workloads used inthe evaluation.Each of them has different proportion of read and write block requests. For all four workloads,working set file generation part at the beginning is removed from the graph. Gcc workload patternis omitted.

All disks run at 7200 rpm. During experiment we let each physically different harddisks serve the workload of different VMs. This way, we can avoid the I/O performancedegradation due to interferences from mixed workloads. Ourintention is to avoid mak-ing storages the bottleneck which would not demonstrate theeffect of network trafficsavings from using vStore. This storage server runs TCP version of NBD (NetworkBlock Device) [20] servers to export a block interface to theclients. Client of NBD sees/dev/nbd0 device upon loading nbd module and running thenbd-clientprogram.This/dev/nbd0 block device can be accessed with normal operations such as open,read, write and close. In our settings,Domain-0of Xen is the nbd client.Domain-0maintains multiple nbd devices and assigns them to the VMs. For example, a configu-ration file specifies mapping/dev/nbd0 as virtual disk/dev/sda4 for a VM.

5.2 Benchmarks

For evaluation of vStore’s performance and effectiveness,we have used various type ofworkloads that exhibit different intensity and read-writeratio.

• Filebench: A file system benchmark developed by Sun that allows users to definecustom workloads through script language. There are predefined workloads. Amongthem we have chosenwebserver, varmail, andfileserverworkloads to evaluate vS-tore.

• Postmark: A benchmark developed by NetApp. It is intended torepresent many smallfile operations generated by Internet software. The benchmark starts by creating filesof specified size and it performs open, append, and delete operations in sequence.

• Gcc build: We have also used gcc-core-3.0 source package building for measuringthe performance.

Page 13: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

Designing a Storage Infrastructure for Scalable Cloud Services 13

FilebenchPostmark Gcc

webserver varmail fileserver

Request/s 1300 1635 1114 2828 46.4

Hypervisor-side Total Read 39k 64k 15k 600 3.6k

(Domain-0) Total Write 38k 34k 52k 121k 10k

Block-level Info RW Ratio 1:1 1.9:1 1:3.5 1:205 1:2.8

IO/s 155.8.0 261.9 354.5 131 N/A

Application-side MB/s 3.1 5.1 4.2 26.6 N/A

Information App RW Ratio 10:1 1:1 1:2 N/A N/A

Table 3.Description of workload characteristicsFilebench Benchmark

Webserver Varmail Fileserver

op/s MB/s Overhead op/s MB/s Overhead op/s MB/s Overhead

AIO 233.42 4.66 974.58 3.36 393.56 4.7

AIO 512 213.44 4.26 8.6% 925.42 3.2 5.0% 374.58 4.5 4.8%

vStore 205.92 4.08 11.8% 911 3.18 5.4% 363.72 4.32 7.6%

Postmark Benchmark Gcc

Elapsed Throughput Overhead Elapsed Overhead

Time(sec) Read Write Time(sec)

AIO 125 9.7MB/s 11.8MB/s 214 sec

AIO 512 133.4 9.1MB/s 11.1MB/s 6.7% 214 sec 0%

vStore 135.8 8.9MB/s 10.9MB/s 8.6% 214 sec 0%

Table 4.Overhead measurement on Filebench, Postmark and Gcc benchmarks.

Fig. 6 shows the time series ofreadandwrite requests for Filebench and Postmarkbenchmarks. Note that each of four workloads has distinct read/write ratio. Theweb-serverworkload has equal number ofreadsandwrites, whereasPostmarkhas largernumber ofwrites than reads. The varmail is the most read-dominant among themand thefileserverhas morewrites thanreads. Precise ratio is summarized in Table 3.Read/write ratio of the application-side information is different from the hypervisor-side information. For all Filebench workloads, write ratiodecreases when requestscome down to the block level. This may be because writes are absorbed by the buffercache at the VFS layer. In terms of workload intensity,Postmarkis the most intensiveand next comes thewebserver. Gcc workload is about two orders of magnitude smallerthan other workloads.

5.3 vStore Overhead Measurements

For overhead measurement, we chose to use local disk’s performance as a base-linecase for comparison. Using a remote storage server’s performance as a base-line posessome ambiguity since remote storage server can have arbitrary level of performance.Depending on the performance level of the remote storage server, it is possible that vS-tore may always perform better or always worse. Only when remote storage server hasequal performance as the local disk vStore is running on, thecomparison can be mean-ingful. Therefore, our primary objective for the vStore’s performance was to achievea tolerable overheadcompared to local disk’s performance. From our design ofblock

Page 14: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

14 Designing a Storage Infrastructure for Scalable Cloud Services

group layout which has 512k trailer for each 4k block, we expected the overhead to beat most 12.5% of the native case. This is because 512k bytes are 12.5% of 4k, and thus,reading it from disk and transferring it would have that muchoverhead.

In order to measure vStore’s overhead againstlocal disk, we have attached a virtualdisk to a VM using aio tapdisk on the file that vStore uses as a cache space. Note thatwe do not use a remote virtual disk for aio tapdisk. This way, all disk requests for thevirtual disk will land on the local disk’s vStore cache space. This allows us to make afair comparison of vStore’s performance with native aio tapdisk since aio and vStorenow uses the same disk area. (In the rest of this section, aio tapdisk uses a remotevirtual disk.) Table 4 presents vStore’s performance overhead. The row labeled as AIOrepresents the native disk performance. It is obtained fromusing Xen’s aio tapdiskwhich uses Linux Asynchronous library [5] for I/O handling.In addition to this, wehave compared vStore with a modified version of AIO, labeled as AIO512. For AIO512,normal aio tapdisk’s reads and writes of 4k blocks are paddedwith 512 bytes at theend, making all block requests to be of 4k+512byte size. Thismodified version revealsthe overhead purely from adding 512 byte trailer to the 4k block requests. It helps usunderstand how much from vStore’s total overhead is due to 512 byte trailer and howmuch is due to vStore’s logic. And, also it serves as the upperbound of performancevStore can achieve when having 512 byte trailer in the design. For all five workloads inTable 4, vStore exhibits less than 12% of overhead. Comparedto AIO512, we can seethat vStore’s overhead not including the overhead from the trailer is about 2% or less.These overheads manifest when workload intensity is as highas Filebench or Postmarkbenchmarks. For lighter workloads as gcc, the overhead of vStore is unnoticeable asshown in the last column.

AIO AIO 512 vStoreop/s op/s (overhead)op/s (overhead)

Single stream read 38.84 34.12 (12.2%) 33.88 (12.8%)Multi stream read 13.54 12.64 (6.6%) 12.38 (8.6%)Single stream write 39.5 35.34 (10.5%) 35.28 (10.7%)Multi stream write 25.28 23.64 (6.5%) 23.64 (6.5%)

Random read 125.34 118.92 (5.1%) 121.22 (3.3%)Random write 1242.56 1143.68 (8.0%) 1168.86 (5.9%)

Table 5.Overhead measurements on various workload

Table 5 presents additionaloverhead measurements using se-quential and random workloads. Itshows us the effect of workloadcharacteristics (sequentiality, sin-gle vs. multi-threadedness) on theoverhead. It can be seen that multi-threaded workloads show roughlyabout half the overhead of single-

threaded ones for stream read and writes. This may be because, with multi-threads,more requests are served from VM’s buffer cache, reducing the proportion of blockrequests from total requests that actually come to Dom-0 through tapdisk mechanism.Also, we notice that random workloads have much higher op/s and somewhat lower vS-tore overheads. Again, randomness increases cache hit rateand reduces the overhead.

5.4 Effect on Network Bandwidth

In this subsection we evaluate the effect of vStore on savingnetwork traffic. We haveused NBD server for providing virtual disks over a 100 Mbps network. The storageserver is located in a separate network. We have measured thenumber of packets andthe number of bytes transferred between the VM host and the storage server for all fiveworkloads.

Page 15: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

Designing a Storage Infrastructure for Scalable Cloud Services 15

1k

4k

7k

10k

13k

16k

19k

4.7k 11.7k 18.8k 25.8k

Num

ber

of P

acke

ts

Number of IOs

AIOvStore

0

5M

10M

15M

20M

25M

4.7k 11.7k 18.8k 25.8k

Num

ber

of B

ytes

Number of IOs

AIOvStore

(a) Observed Number of Packets & Bytes for webserver workload

1k

4k

7k

10k

13k

16k

19k

8.7k 20k 32.5k 44.4k

Num

ber

of P

acke

ts

Number of IOs

AIOvStore

0

5M

10M

15M

20M

8.7k 20k 32.5k 44.4k

Num

ber

of B

ytes

Number of IOs

AIOvStore

(b) Observed Number of Packets & Bytes for varmail workload

1k

4k

7k

10k

13k

16k

19k

8.4k 21k 33k 46k

Num

ber

of P

acke

ts

Number of IOs

AIOvStore

0

5M

10M

15M

20M

8.4k 21k 33k 46k

Num

ber

of B

ytes

Number of IOs

AIOvStore

(c) Observed Number of Packets & Bytes for fileserver workload

Fig. 7. Comparison of Network traffics generated by threeFilebench workloads for AIO and vStore settings.

Fig. 7 illustrates the timeseries of generated traf-fics for webserver, varmailand fileserver workloads ofFilebench. The amount oftraffics generated by vStoreis almost always less thanthe AIO traffics and wecan also see the diminishingtrend of traffic volume fromvStore for varmail and file-server. Table 6 gives us theactual amount of networktraffic savings achieved byvStore. The amount of sav-ings has a rough correla-tion with the read-write ra-tio of the workloads. Recallthat fileserver and Postmarkworkloads have high writeproportion. More write traf-fic implies that more trafficwill be absorbed by vStorebecause writes does not re-quire fetching data throughnetwork access. Higher traf-fic reduction of fileserverworkloads (43.4% bytes)from varmail (38.6%) orwebserver (28.9%) confirms this. The same relationship holds between Postmark andothers.

Number of Packets Number of BytesAIO vStore Saving AIO vStore Saving

webserver 667k 291k 56.3% 756 MB 538 MB 28.9%varmail 593k 204k 65.5% 670 MB 412 MB 38.6%

fileserver 636k 206k 67.6% 650 MB 368 MB 43.4%postmark 1991k 138 99.9% 1743 MB 3 MB 99.9%

gcc 35k 12k 65.4% 35 MB 22 MB 37.4%

Table 6.Comparison of network traffic savings

Postmark benchmark shows 99%network traffic savings according toTable 6. This is because the Postmarkbenchmark is highlywrite-dominantat the early part of the benchmark run.The operations of the Postmark arecreation of a file, append, read anddelete. Followingreads are absorbedby vStore because they arereads on previously created data. This behavior may bepresent in only the subset of real-world applications, but we can see that vStore benefitsmore if the workload is morewrite-oriented. Fig. 8 (b) shows the cumulative packetsfor gcc workload. At the early part of the gcc run, vStore generates more packets thanAIO. It is because vStore reads remote data at the granularity of block groups which is128 KB. But, once sufficient blocks are cached, network traffic flattens out and finisheswith about 65% traffic savings.

Page 16: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

16 Designing a Storage Infrastructure for Scalable Cloud Services

0

4k

7k

10k

13k

16k

19k

52k 157k 262k 367k

Num

ber

of P

acke

ts

Number of Block Requests

AIOvStore

1k

5k

10k

15k

20k

25k

2.4k 4.8k 7.2k 9.6k 12k

Num

ber

of P

acke

ts

Number of Block Requests

AIOvStore

(a) Postmark (b) Cumulative Packets for Gcc

Fig. 8.Postmark and Gcc Results

������������������ ���� �����

����� ������� �

(iops

)

run t

ime (

seco

nds)

400

500

600

700

800

�����������

�� ������ ������� ���������� ��������(iops) (iops) (iops) (seconds)

100

200

300

400��������������� ���� �����

����� ������� �

(iops

)

run t

ime (

seco

nds)

400

500

600

700

800

���������

�� ����� ������� ���� ����� �� �����(iops) (iops) (iops) (seconds)

100

200

300

400

(a) Under network saturation (b) Under storage saturation

Fig. 9. Performance degradation due to network saturation and due to storage server being thebottleneck. Postmark result is in terms of total run times whereas others are in terms of io per sec.

Results in this section are drawn from the initial run of the workloads to the coldvStore cache state. Following runs of the workloads shows almost no traffic generation.In real environments, initial virtual disks attached to theVM would be empty and datawould start to be filled through writes first. Hence, it is unlikely that vStore will en-counter large volume ofreadsof unseen blocks. However, in the experiment, we havedisabled vStore during the initial working set creation stage and re-enabled it at therunning stage for the measurements in order to test vStore inan unfavorable condition.

5.5 Multiple VM Runs

In this subsection we look at the performance degradation caused by network satu-ration as well as storage congestion at the storage server side, and demonstrate howvStore can be used to alleviate the problem. We consider two scenarios. In the first sce-nario, total of four VMs use virtual disks provided by one remote storage server over100 mbps network on different subnet. Although 100 Mbps is used rather than 1 Gbpsor higher, we believe that it does not affect the demonstration of network saturation.At higher bandwidth, it is straightforward to saturate the network by simply activatingmore VMs. In order to make the network a bottleneck, we have placed four virtual disksfor each VM on physically separate disks. Then, we ran multiple VMs with workloadsand measured the degradation for AIO and vStore. From Fig. 9(a) we can see that asthe number of VM increases and network becomes saturated, the performance degrades

Page 17: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

Designing a Storage Infrastructure for Scalable Cloud Services 17

quickly. For all workloads we tested, the performance dropped to about 20% to 30% ofthe single-VM runs. The last white bar shows the results of running four VMs togetherusing vStore. Even with all four VMs running together, the performance is better than2 VM cases. In the second scenario, we have placed all four virtual disks into a sin-gle physical disk to observe the effect of storage saturation on performance. Fig. 9(b)shows the results as well as vStore’s performance. In our experimental environment,storage bottleneck has more impact than network saturation. Regardless of bottlenecktypes, vStore is able to maintain good performance even whenmultiple VMs access thestorage server at the same time.

5.6 Destaging

10

20

30

40

50

60

30 sec 60 sec 90 sec 120 sec 150 sec

Dirt

y P

ropo

rtio

n(%

)

No Destaging(1) R:20 ms, Gamma:.2(2) R:25 ms, Gamma:.5

(a) Pollution rate change over time.

0

20

40

60

80

100

30 sec 60 sec 90 sec 120 sec 150 sec

Des

taag

ed C

ount

s (1) R:20 ms, Gamma:.2(2) R:25 ms, Gamma:.5

(b) Change of Destage I/Os issued.

0

20

40

60

80

100

30 sec 60 sec 90 sec 120 sec 150 sec

Res

pons

e T

ime

(ms) (1) R:20 ms, Gamma:.2

(2) R:25 ms, Gamma:.5

(c) Change of Response time for Remote Storage.

Fig. 10.Destaging behavior.

Fig. 10 illustrates the behaviorof vStore destage implementa-tion. The workload used is theFilebenchvarmail. Destagingis set to start at 10% pollutionthreshold. We tested destag-ing with two different con-figurations - (1) R=20 withγ=.2 and (2)R=25 with γ=.5.Configuration (1) representsless aggressive destaging. (a)shows the change of pollu-tion for three cases. For bothof the destaging cases (1) and(2), destaging starts to reducethe number of dirty blocks ataround 20 sec. Configuration(2) shows the faster rate ofdestaging. We have changedthe network load for the net-work attached storage to seethe effect of latency changesto destaging. While configura-tion (2) was running, we per-formed heavy network copyingat time 90 and it continued for30 seconds. (b) and (c) showsthese perturbations. Responsetime for configuration (2) shows the point where it suddenly increases. At that mo-ment the destage count for configuration (2) shows that it drops sharply at time 90 andstays at 0. After network load is removed, it starts to regainthe destage count.

Page 18: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

18 Designing a Storage Infrastructure for Scalable Cloud Services

6 Related Work

Parallax [17] provides storage services to virtual machines, with a focus on fast andefficient snapshot. Authors also briefly touch upon the issues of using the local storagespace as a cache, but they treat it as a write-log style bufferto absorb bursts of writetraffic. However, in vStore, we are interested in utilizing the local storage space as a set-associative cache to support normal storage operations as much as possible. Also, ourgoal is more about avoiding heavy network traffics when usingremote storage spaces.

The idea of using disk as a cache for another disk has been explored in the workof Disk Caching Disks (DCDs) [13, 21]. They propose to use a small log disk, calledcache-disk, as an extension of small NVRAM buffer on top of data-disk. Small writerequests are collected in the NVRAM buffer and when it is full, the contents are dumpedto the cache-disk. When data-disk becomes idle, it performs destaging in which data istransferred from cache-disk to data-disk. This has been extended to the scenario inwhich two disks are separated over the network using iSCSI protocol [11]. In theseworks, caching space is used to hold write logs similar to thetechniques used in the log-structured file systems [23]. However, we do not adopt this caching technique becausewe want to support efficient disk operations for both read andwrites. Collecting writelogs would optimize write operations, but would impact readperformances.

Many previous researches exist on the issue of providing block-level storage asstorage servers. FAB [24], petal [16] and Data ONTAP GX [7] are all research effortstoward building storage servers using distributed or clustered physical storage. Theirprimary goal is to provide a unified view of storage space frommultiple storage devicesor filers. The vStore is complimentary to these since vStore does not restrict the type ofstorage servers as long as block interface is available to the hypervisor. Our focus is onthe client side storage systems.

7 Conclusion

Designing a storage system for Cloud Computing is a challenging task. High degreeof virtualization increases the demand for storage spaces and this requires the use ofremote storage spaces. However, uncontrolled access to theremote storage from largenumber of Virtual Machines can easily saturate the networking infrastructure and affectthe entire systems using the network. We have developed vStore to solve these issues.vStore utilizes local storage spaces to absorb large portion of network accesses gener-ated from VMs. Experiments shows that the overhead is less than 20% and it is able toeliminate 60% to 99% of the traffics.

This work is the first step towards our larger goal of buildinga full-fledged Cloudstorage infrastructure. This infrastructure includes features such as cache block transferbetween VM hosts to support fast migration, replication of cache blocks to nearby stor-age (possibly at higher level of hierarchy) within other hosts to support fast restart ofVMs on a failed host, and finally an intelligent workload balancing mechanism betweenusing the local storage vs. the remote storage for performance optimization. Our mainfocus in this work is to lay a foundation for these advanced features by studying thefeasibility of vStore approach and understanding the effect of our design choices on theperformance.

Page 19: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

Designing a Storage Infrastructure for Scalable Cloud Services 19

References

1. M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network archi-tecture. InProceedings of the ACM SIGCOMM 2008 conference on Data communication,SIGCOMM ’08, pages 63–74, New York, NY, USA, 2008. ACM.

2. Amazon Elastic Compute Cloud.http://aws.amazon.com/ec2/.3. P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and

A. Warfield. Xen and the art of virtualization. InSOSP ’03: Proceedings of the 19th ACMSymposium on Operating Systems Principles, pages 164–177, New York, NY, USA, 2003.ACM.

4. L. A. Barroso and U. Holzle. The Datacenter as a Computer: An Introduction to the Designof Warehouse-Scale Machines, volume Lecture #6.

5. S. Bhattacharya, S. Pratt, B. Pulavarty, and J. Morgan. Asynchronous i/o support in linux2.5. InProceedings of the Ottawa Linux Symposium, 2003, 2003.

6. R. Campbell, I. Gupta, M. Heath, S. K. Ko, M. Kozuch, M. Kunze, T.Kwan, K. L. Lai,H. Y. Lee, M. Lyons, D. Milojicic, D. O’Hallaron, and Y. C. Soh. Open cirrus cloud com-puting testbed: Federated data centers for open source systems and services research. InHOTCLOUD’09: Proceedings of the 2009 Workshop on Hot Topics in Cloud Computing,2009.

7. M. Eisler, P. Corbett, M. Kazar, D. S. Nydick, and C. Wagner. Data ontap gx: a scalablestorage cluster. InFAST ’07: Proceedings of the 5th USENIX conference on File and StorageTechnologies, pages 23–23, Berkeley, CA, USA, 2007. USENIX Association.

8. Google AppEngine.http://code.google.com/appengine/.9. A. Gulati, I. Ahmad, and C. A. Waldspurger. Parda: proportional allocation of resources

for distributed storage access. InFAST ’09: Proccedings of the 7th conference on File andstorage technologies, pages 85–98, Berkeley, CA, USA, 2009. USENIX Association.

10. C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu. Dcell: a scalable and fault-tolerantnetwork structure for data centers. InProceedings of the ACM SIGCOMM 2008 conferenceon Data communication, SIGCOMM ’08, pages 75–86, New York, NY, USA, 2008. ACM.

11. X. He, Q. Yang, and M. Zhang. A caching strategy to improve iscsi performance. InLCN’02: Proceedings of the 27th Annual IEEE Conference on Local Computer Networks, page0278, Washington, DC, USA, 2002. IEEE Computer Society.

12. D. Howells. Fs-cache: A network filesystem caching facility. InProceedings of the LinuxSymposium, volume 1, 2006.

13. Y. Hu and Q. Yang. Dcd—disk caching disk: a new approach for boosting i/o performance.In ISCA ’96: Proceedings of the 23rd annual international symposium onComputer archi-tecture, pages 169–178, New York, NY, USA, 1996. ACM.

14. J. J. Kistler and M. Satyanarayanan. Disconnected operation in the coda file system.ACMTrans. Comput. Syst., 10:3–25, February 1992.

15. D. Lazenby. Book review: Managing afs: Andrew file system.Linux J., 1998, September1998.

16. E. K. Lee and C. A. Thekkath. Petal: distributed virtual disks. InASPLOS-VII: Proceedingsof the seventh international conference on Architectural support for programming languagesand operating systems, pages 84–92, New York, NY, USA, 1996. ACM.

17. D. T. Meyer, G. Aggarwal, B. Cully, G. Lefebvre, M. J. Feeley, N. C. Hutchinson, andA. Warfield. Parallax: virtual disks for virtual machines. InEurosys ’08: Proceedings ofthe 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008, pages 41–54, New York, NY, USA, 2008. ACM.

18. Microsoft Azure.http://www.microsoft.com/azure/.

Page 20: Designing a Storage Infrastructure for Scalable Cloud Services...2 Designing a Storage Infrastructure for Scalable Cloud Services different types of Cloud services such as Microsoft’s

20 Designing a Storage Infrastructure for Scalable Cloud Services

19. M. Nelson, B. Welch, and J. Ousterhout. Caching in the sprite networkfile system. InProceedings of the eleventh ACM Symposium on Operating systems principles, SOSP ’87,pages 3–4, New York, NY, USA, 1987. ACM.

20. Network Block Device.http://nbd.sourceforge.net/.21. T. Nightingale, Y. Hu, and Q. Yang. The design and implementation of adcd device driver

for unix. In ATEC ’99: Proceedings of the annual conference on USENIX AnnualTechnicalConference, pages 22–22, Berkeley, CA, USA, 1999. USENIX Association.

22. D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D. Zagorod-nov. The eucalyptus open-source cloud-computing system. InCCGRID ’09: Proceedingsof the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid,pages 124–131, Washington, DC, USA, 2009. IEEE Computer Society.

23. M. Rosenblum and J. K. Ousterhout. The design and implementation ofa log-structured filesystem.ACM Trans. Comput. Syst., 10(1):26–52, 1992.

24. Y. Saito, S. Frolund, A. Veitch, A. Merchant, and S. Spence. Fab:building distributed en-terprise disk arrays from commodity components. InASPLOS-XI: Proceedings of the 11thinternational conference on Architectural support for programming languages and operatingsystems, pages 48–58, New York, NY, USA, 2004. ACM.

25. R. Sandberg, D. Golgberg, S. Kleiman, D. Walsh, and B. Lyon. Innovations in internetwork-ing. chapter Design and implementation of the Sun network filesystem, pages 379–390.Artech House, Inc., Norwood, MA, USA, 1988.

26. E. Van Hensbergen and M. Zhao. Dynamic policy disk caching for storage networking. InIBM Research Report (RC24123), 2006.

27. D. X. Wei, C. Jin, S. H. Low, and S. Hegde. Fast tcp: motivation, architecture, algorithms,performance.IEEE/ACM Trans. Netw., 14(6):1246–1259, 2006.

28. Xen blktap Overview.http://wiki.xensource.com/xenwiki/blktap.