Bidirectional data verification for cloud storage

12
Bidirectional data verication for cloud storage Mohammad Iftekhar Husain a,n , Steven Y. Ko b , Steve Uurtamo b , Atri Rudra b , Ramalingam Sridhar b a Department of Computer Science, California State Polytechnic University, Pomona, CA, United States b Department of Computer Science & Engineering, University at Buffalo, The State University of New York, NY, United States article info Article history: Received 13 September 2013 Received in revised form 25 February 2014 Accepted 6 July 2014 Available online 23 July 2014 Keywords: Storage enforcement Proof of retrievability Cloud storage Proof of data possession Proof of ownership abstract This paper presents a storage enforcing remote verication scheme, PGV (Pretty Good Verication) as a bidirectional data integrity checking mechanism for cloud storage. At its core, PGV relies on the well- known polynomial hash; we show that the polynomial hash provably possesses the storage enforcement property and is also efcient in terms of performance. In addition to the traditional application of a client verifying the storage content at a remote server, PGV can also be applied to de-duplication scenarios where the server wants to verify whether the client possesses a signicant amount of information about a le (and not just a partial knowledge/ngerprint of the le) before granting access to an existing le. While existing schemes are often developed to handle a malicious adversarial model, we argue that such a model is often too strong of an assumption, resulting in over-engineered, resource-intensive mechanisms. Instead, the storage enforcement property of PGV aims at removing a practical incentive for a storage server to cheat in order to save on storage space in a covert adversarial model. We theoretically prove the power of PGV by combining Kolmogorov complexity and list decoding and experimentally show the simplicity and low overhead of PGV by comparing it with existing schemes. Altogether, PGV provides a good, practical way to perform storage enforcing remote verication. & 2014 Elsevier Ltd. All rights reserved. 1. Introduction A general structure of the cloud storage data integrity verication problem is the following: a verier (client) (i) uploads its data to a remote prover (cloud storage), (ii) deletes the local copy of the data (to save on storage), and (iii) at some later point, tries to verify if the prover is storing the data correctly i.e. assurance of integrity; and the data can be retrieved when necessary i.e. retrievability. This verica- tion should take place without retrieving the complete data from the remote storage (to save on bandwidth). Existing hash functions such as MD5 and SHA1 can provide integrity but fails to provide retrieva- bility. This motivated some interesting contributions to this domain such as proof of data possession (PDP) or proof of retrievability (POR) (Ateniese et al., 2007; Bowers et al., 2009a; Golle et al., 2002; Juels and BSK, 2007; Schwarz and Miller, 2006; Wang C et al., 2009; Wang Q et al., 2009). Existing PDP or POR schemes consider a malicious adversarial model (Canetti, 2006), which translates to assuming that legit- imate cloud storage providers such as Amazon will behave arbitrarily to tamper with client's data. In order to satisfy this strict adversarial model, existing schemes are complex (including modication of original data Bowers et al., 2009a; Wang C et al., 2009; Golle et al., 2002) and performance inefcient (both in time and space requirements Ateniese et al., 2007; Golle et al., 2002; Juels and BSK, 2007; Wang Q et al., 2009). The primary motivation for this paper is to show that if we relax this strict adversarial model and focus instead on a more practical adversarial model, we can develop a simpler, far more light-weight verication scheme. Therefore, instead of considering a malicious adversarial model, we consider a covert adversarial model (Aumann and Lindell, 2010). This model assumes that the adversary is willing to cheat if: (i) it has some incentive and (ii) it will not be caught. It nicely captures many real world scenarios including remote storage verication where there is a practical incentive for a provider to cheat in order to save on storage as long as it is not caught. This is more practical since storage is the main commodity for storage providersa storage provider typically charges its clients primarily for the amount of storage that each client uses. Also, the provider incurs a signicant amount of cost in handling and managing the storage for its clients such as the costs for hard drives, storage area networking, and power consumption (Moore et al., 2007; Allalouf et al., 2009). Thus, there is a practical incentive for a provider to cheat in order to save on storage. Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/jnca Journal of Network and Computer Applications http://dx.doi.org/10.1016/j.jnca.2014.07.003 1084-8045/& 2014 Elsevier Ltd. All rights reserved. n Corresponding author at: Department of Computer Science, Cal Poly Pomona, Pomona, CA. Tel.: 909 869 2022. E-mail address: [email protected] (M.I. Husain). Journal of Network and Computer Applications 45 (2014) 96107

Transcript of Bidirectional data verification for cloud storage

Page 1: Bidirectional data verification for cloud storage

Bidirectional data verification for cloud storage

Mohammad Iftekhar Husain a,n, Steven Y. Ko b, Steve Uurtamo b,Atri Rudra b, Ramalingam Sridhar b

a Department of Computer Science, California State Polytechnic University, Pomona, CA, United Statesb Department of Computer Science & Engineering, University at Buffalo, The State University of New York, NY, United States

a r t i c l e i n f o

Article history:Received 13 September 2013Received in revised form25 February 2014Accepted 6 July 2014Available online 23 July 2014

Keywords:Storage enforcementProof of retrievabilityCloud storageProof of data possessionProof of ownership

a b s t r a c t

This paper presents a storage enforcing remote verification scheme, PGV (Pretty Good Verification) as abidirectional data integrity checking mechanism for cloud storage. At its core, PGV relies on the well-known polynomial hash; we show that the polynomial hash provably possesses the storage enforcementproperty and is also efficient in terms of performance. In addition to the traditional application of a clientverifying the storage content at a remote server, PGV can also be applied to de-duplication scenarioswhere the server wants to verify whether the client possesses a significant amount of information abouta file (and not just a partial knowledge/fingerprint of the file) before granting access to an existing file.

While existing schemes are often developed to handle a malicious adversarial model, we argue thatsuch a model is often too strong of an assumption, resulting in over-engineered, resource-intensivemechanisms. Instead, the storage enforcement property of PGV aims at removing a practical incentive fora storage server to cheat in order to save on storage space in a covert adversarial model.

We theoretically prove the power of PGV by combining Kolmogorov complexity and list decoding andexperimentally show the simplicity and low overhead of PGV by comparing it with existing schemes.Altogether, PGV provides a good, practical way to perform storage enforcing remote verification.

& 2014 Elsevier Ltd. All rights reserved.

1. Introduction

A general structure of the cloud storage data integrity verificationproblem is the following: a verifier (client) (i) uploads its data to aremote prover (cloud storage), (ii) deletes the local copy of the data(to save on storage), and (iii) at some later point, tries to verify if theprover is storing the data correctly i.e. assurance of integrity; and thedata can be retrieved when necessary i.e. retrievability. This verifica-tion should take place without retrieving the complete data from theremote storage (to save on bandwidth). Existing hash functions suchas MD5 and SHA1 can provide integrity but fails to provide retrieva-bility. This motivated some interesting contributions to this domainsuch as proof of data possession (PDP) or proof of retrievability (POR)(Ateniese et al., 2007; Bowers et al., 2009a; Golle et al., 2002; Juelsand BSK, 2007; Schwarz and Miller, 2006; Wang C et al., 2009; WangQ et al., 2009).

Existing PDP or POR schemes consider a malicious adversarialmodel (Canetti, 2006), which translates to assuming that legit-imate cloud storage providers such as Amazon will behave

arbitrarily to tamper with client's data. In order to satisfy thisstrict adversarial model, existing schemes are complex (includingmodification of original data Bowers et al., 2009a; Wang C et al.,2009; Golle et al., 2002) and performance inefficient (both in timeand space requirements Ateniese et al., 2007; Golle et al., 2002;Juels and BSK, 2007; Wang Q et al., 2009).

The primary motivation for this paper is to show that if we relaxthis strict adversarial model and focus instead on a more practicaladversarial model, we can develop a simpler, far more light-weightverification scheme. Therefore, instead of considering a maliciousadversarial model, we consider a covert adversarial model (Aumannand Lindell, 2010). This model assumes that the adversary is willingto cheat if: (i) it has some incentive and (ii) it will not be caught.It nicely captures many real world scenarios including remote storageverificationwhere there is a practical incentive for a provider to cheatin order to save on storage as long as it is not caught. This is morepractical since storage is the main commodity for storage providers—a storage provider typically charges its clients primarily for theamount of storage that each client uses. Also, the provider incurs asignificant amount of cost in handling and managing the storage forits clients such as the costs for hard drives, storage area networking,and power consumption (Moore et al., 2007; Allalouf et al., 2009).Thus, there is a practical incentive for a provider to cheat in order tosave on storage.

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/jnca

Journal of Network and Computer Applications

http://dx.doi.org/10.1016/j.jnca.2014.07.0031084-8045/& 2014 Elsevier Ltd. All rights reserved.

n Corresponding author at: Department of Computer Science, Cal Poly Pomona,Pomona, CA. Tel.: 909 869 2022.

E-mail address: [email protected] (M.I. Husain).

Journal of Network and Computer Applications 45 (2014) 96–107

Page 2: Bidirectional data verification for cloud storage

To remove this incentive, a client has to be able to verify thatthe storage provider is actually committing as much storage spaceas the amount of data that the client requested to store. In otherwords, we need a verification scheme with the property of storageenforcement. With this verification, clients can safely assume thatthey are rightfully paying for the service, i.e., the amount of server-side storage they are getting. In addition, clients should be able todo this without asking the server to return the data it claims tostore to avoid communication overhead.

The main contribution of this paper is a storage enforcingremote verification scheme, PGV (Pretty Good Verification). Thismeans roughly that, in order to pass our verification, a prover(cloud storage) has to commit as much storage space as theinformation content of the original data. This removes storagesaving as an incentive of cheating for the prover. More specifically,if the prover passes our verification of the original data x withprobability ε40, then the prover has to store C(x) bits of data upto a very small additive factor. C(x) is the plain Kolmogorovcomplexity of x, which is the size of the smallest algorithmicdescription of x (Li and Vitanyi, 2008; Kolmogorov, 1965). Thereason why we enforce C(x) instead of x is because we cannotprevent a prover from compressing x; C(x) is a natural way torepresent the amount of information stored in x.

PGV is built upon polynomial hash (Bierbrauer et al., 1993;Freivalds, 1977). We show that it has the storage enforcementproperty, i.e., it can be used to verify that a remote party actuallystores as much data as it claims to store; in addition, we also showthat it has a weaker form of proof of retrievability property, i.e., itcan be used to verify that a remote party has enough informationto recreate the data that it claims to store with good probability.Most importantly, the polynomial hash provides these propertieswith a significant performance benefit as it is just a simple hashfunction and without modifying the original data compared to theexisting schemes.

Simply put, PGV works as follows: the client (i) picks a randomnumber (our key) βa0 from a Galois field (Plank, 2003), (ii)divides the data block into equal-sized symbols, S0;…; Sk�1, wherethe size is equal to the field size, and (iii) computes the polynomialhash Hc ¼∑k�1

i ¼ 0Siβi. The client then stores the β and Hc

locally, sends the data to the cloud server and deletes the localcopy of the data. For verification of a data block, the client sendsthe β to the server. In return, the server computes and sendsback the hash value (Hs) of the data block using the polynomialhash function and β. If this Hs is equal to the Hc stored locally withthe client, then the client declares that the verification issuccessful.

For the clarity of presentation, we have illustrated our storageenforcement scheme with only one scenario where a client verifiesa remote server (or multiple servers). However, our scheme ismuch more flexible and applicable to many scenarios; as we detailin Section 2, it is applicable in proof of ownership (Halevi et al.,2011) applications where a remote server verifies a client beforegranting access to its data; resulting in the bidirectional verificationproperty of the PGV scheme.

Our proofs combining Kolmogorov complexity and list decodingshow that this simple construction can guarantee many of thedesired properties for remote storage verification as follows (detailsin Section 3):

Simplicity of construction: our scheme does not require expensiveprimitives or significant pre-processing. As a result, our hash (alsoknown as tag, token, or authenticator) generation and verificationincur very little overhead.

Storage enforcement: if a cloud storage stores y that is differentfrom the original data x and is able to pass our verification, thecloud storage has to commit as much storage space as the size ofthe original data, i.e., jyj � jxj.

Proof of retrievability: if a cloud storage stores y that is differentfrom the original data x and is able to pass our verification, then itis possible to reconstruct x from y.

Non-transformation of data: we do not transform the originaldata before we store it. More specifically, a cloud storage does notneed to store any extra information and our verification ability haslittle effect on normal read/write operations. This property alsohelps in the extended application of PGV for proof of ownershipproblem.

Resource optimality: a client requires a minimum amount ofstorage to store challenges. Also, a server does not need to storeanything other than the original data. In addition, verificationrequires a minimum amount of communication since both the βand Hc are very small in size (usually a few hundred bits).

Computationally unbounded adversary:we allow an adversary tohave unbounded computational power, unlike cryptographicassumptions where the security is against randomized polynomialtime adversaries. In contrast, we prove our results against muchstronger adversaries that are only required to halt on all inputs(but do not have any pre-specified time bounds).

Repeatability: PGV has practical constructions to support repeat-ability (unlimited verification) feature which is common to existingschemes with asymmetric cryptographic support (Section 3.6).

Complete detection: our scheme provides guarantees over theentire data, unlike many of previous schemes (Juels and BSK,2007; Wang C et al., 2009; Bowers et al., 2009a; Ateniese et al.,2007; Wang Q et al., 2009; Golle et al., 2002) that provideverification guarantees for some fraction of data (sampling). Thisfraction can be large, but nevertheless, not complete. However, fora fair comparison with existing schemes, PGV also provides asampling construction with probabilistic guarantee (Section 3.5).

We have implemented our widely applicable storage enforcingremote verification scheme, PGV and evaluated through extensiveexperiments with data files of sizes 1 MB to 1 GB. Our experi-mental results demonstrate that PGV is significantly performanceefficient compared to existing proof of data possession (PDPAteniese et al., 2007) and proof of ownership (PoW Halevi et al.,2011) schemes.

The rest of the paper is organized as follows. Section 2discusses the diverse applications of the PGV scheme. A detailedconstruction of the PGV scheme with proofs of different features isdiscussed in Section 3. Section 4 discusses practical considerationsfor the PGV implementation. We measure the performance of thePGV scheme and compare it with an existing remote verificationscheme, PDP in Section 5. We also compare the performance ofPGV with an existing de-duplication mechanism, PoW in Section 6.Section 7 provides a comprehensive comparison of features of theexisting mechanisms as well as their drawbacks. Finally, Section 8concludes the paper.

2. Applications of storage enforcing remote verification

Our storage enforcing verification scheme is potentially applicablein many scenarios in addition to the aforementioned scenario ofclient verifying a remote server. An additional application is proof ofownership (Halevi et al., 2011; Zheng and Xu, 2012). Consider the casewhen a client wants to upload its data x to a storage service (such asDropbox) that uses ‘deduplication’. In particular, if x is already storedon the server, then the server will ask the client not to upload x andgive it ownership of x. To save on communication, the sever asks theclient to send it a hash h(x) and if it matches the hash of the stored xon the server, the client is issued an ownership of x. As identified byHalevi et al. (2011), this simple deduplication scheme can be abusedfor malicious purposes. For example, if a malicious user gets hold ofthe hashes of the files stored on the server (through server break-in

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107 97

Page 3: Bidirectional data verification for cloud storage

or unintended leakage), it will get access to those files. The simplededuplication scheme can also be exploited as an unintendedcontent distribution network. A malicious user can upload a poten-tially large file and share the hash of the file with accomplices. Now,all these accomplices can present the hash to storage server and getaccess to large file as if it is a content distribution network. A storageenforcing remote verification scheme such as PGV can address such asituation. Since the prover and the verifier are reversed from theclient verifying storage provider scenario, the performance restrictionis even more severe. The computation that runs at the client-side hasto be light-weight because of the limited capacity of client devicessuch as smartphones. Further, we cannot expect the client to modifythe data to aid in the verification process (as it is needed in someexisting PDP/POR schemes).

There are application scenarios where storage enforcementalone suffices. For example, we can use storage enforcement toprovide a remote “kill” switch that wipes a remote untrusted harddrive clean, as we can send a random string of the same size as thehard drive and verify if the remote end stored the same size string.(For random string x it is known that with high probabilityCðxÞZ jxj�Oð1Þ so this will force the remote hard disk to be wipedclean.) Another example is when a storage provider has differentclasses of servers, e.g., premium servers and normal servers, wherepremium servers provide better QoS, but are more expensive tooperate. In this scenario, storage enforcement alone can removethe incentive for the provider to move a premium client's datafrom the premium servers to normal servers; the premium serversneed to store as much data, and the provider cannot reduce thecost of operation by “cheating”.

3. Pretty good verification

This section presents the general construction of our storageenforcing remote verification scheme, PGV and formal proofs of theproperties that our scheme provides. For the clarity of presentation,we have illustrated our storage enforcement scheme with only onescenario where a client verifies a remote server (or multiple servers).However, our scheme is much more flexible and applicable to manyscenarios; as we detail in Section 2, it is applicable in proof ofownership (Halevi et al., 2011) applications where a remote serververifies a client before granting access to its data.

3.1. Basic scheme

Our basic scheme is very simple—we use a polynomial hash overa finite field to check whether the storage provider still has the datathat we sent them. We generate one or more keys at random, hashthe data with the keys, and store the keys and hashes together. Whentime comes to do verification, we send a single key to the remoteend, and ask for them to compute the keyed hash and send it back,comparing the result with what is stored.

To compute a polynomial hash, a client needs to pick system-wide parameters once and perform three steps whenever export-ing a data block to a remote storage. There are two importantsystem-wide parameters: the finite field size q (e.g., 232 orequivalently, 4 B) and the block size (e.g., 64 KB). Picking the rightvalues for these parameters involves considerations for perfor-mance and security. We discuss these considerations further in ourevaluation in Section 5.

Once the client finishes picking the system-wide parameters,the client can compute a polynomial hash for each data block to beexported: (i) pick a random number (our key) β from the field, (ii)divide the data block into equal-sized symbols, S0;…; Sk�1, wherethe size is equal to the field size (e.g., 4 B), and (iii) compute thepolynomial hash Hc ¼∑k�1

i ¼ 0Siβi. This is equivalent to computing a

single entry in a Reed–Solomon code word. The rest of the sectionproves that this simple hash can provide interesting properties forremote verification. This process is per-block. For a file, there isone additional step that divides the file into blocks. Then the clientneeds to repeat the above three steps for each block.

3.2. Summary of our guarantees and proof roadmap

We can summarize our two main results for remote verificationas follows:

� (Proved in Section 3.4.1) If the server can pass our verificationprotocol with a probability greater than ε (where ε can be madeto be very small, e.g. 1%), then the server must provably storealmost as much as the Kolmogorov complexity of the user string(data). Another way to state this is that, if, for example, ε¼1% andthe server stores slightly less than Kolmogorov complexity C(x),then the server will be caught with probability at least 99%.

� (Proved in Section 3.4.2) If the server can pass our verificationprotocol with a probability greater than 1/2þγ (where γ can bevery small, e.g. 1%), then the server has enough information torecreate the user string. Another way to state this is that, if, forexample, γ¼1% and the server cannot recreate the user data,then it will be caught with probability at least 49%.

In a nutshell, even if a storage provider intends to cheat, we areable to prove that they cannot simultaneously satisfy our requestswith high probability and save on space usage. Moreover, if theydo satisfy our requests with high enough probability, we couldreconstruct our data from their request responses alone.

The proofs take advantage of the fact that a polynomial hash isjust a single entry in an appropriately-chosen Reed–Solomon codeword. The proofs and the scheme itself in fact generalize to any error-correcting code and corresponding keyed hash, although this is outof the scope of this paper.

Our proofs in the rest of this section present the results interms of multiple servers, not a single server. The reason is not thatPGV requires multiple servers; rather, it is because our proofs areeasily generalizable to multiple servers.

3.3. Definitions

We use Fq to denote the finite field over q elements. We alsouse [n] to denote the set f1;2;‥;ng. Given any string xAFn

q, we usejxj to denote the length of x in bits. Additionally, all logarithms willbe base 2 unless otherwise specified.

We now formally define the different parameters of a verificationprotocol. We use U to denote the user/client. We assume that Uwants to store its data xAFkq among s service providers P1;…; Ps. Inthe pre-processing step of our setup, U sends x to P ¼ fP1;…; Psg bydividing it up equally among the s servers – we will denote thechunk sent to server iA ½s� as xiAF

n=sq .1 Each server is then allowed to

apply any computable function2 to its chunk and to store a stringyiAFn

q. Ideally, we would like yi¼xi. However, since the servers cancompress x, we would at the very least like to force jyij to be as closeto CðxiÞ as possible. For notational convenience, for any subset TD ½s�,we denote yT (xT resp.) to be the concatenation of the strings fyigiAT(fxigiAT resp.).

To enforce the conditions above, we design a protocol. We willbe primarily concerned with the amount of storage at the client

1 We will assume that s divides n. Our protocols do not need the xi's to have thesame size, only that x can be partitioned into x1 ;…; xs . For ease of explication, weignore this possibility for the rest of the paper.

2 A computable function can be computed by an algorithm that halts on all itsinputs though there is no pre-specified bound on its time complexity.

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–10798

Page 4: Bidirectional data verification for cloud storage

side and the amount of communication and want to minimizeboth simultaneously while giving good verification properties. Thefollowing definition captures these notions. (The definition alsoallows for some of the servers to collude.)

Definition 1. Let s; c;mZ1 and 0rrrs be integers, 0rρr1 be areal and f : ½q�n-RZ0 be a function. Then an (s,r)-party verificationprotocol with resource bound (c,m) and verification guarantee ðρ; f Þis a randomized protocol with the following guarantee. For anystring xA ½q�k, U stores at most m bits and communicates at most cbits with the s servers. At the end, the protocol either outputs a1 or a 0. Finally, the following is true for any TD ½s� with jT jrr:if the protocol outputs a 1 with probability at least ρ, thenassuming that every server iA ½s�\T followed the protocol and thatevery server in T possibly colluded with one another, we havejyT jZ f ðxT Þ.

Our protocol requires that we first pick a family of “keyed” hashfunctions. For its speed and flexibility, we use Reed–Solomoncodes for the protocol. The protocol will pick random keys andstore the corresponding hash values for x (along with the keys)during the pre-processing step. During the verification step, Usends one or more keys (depending upon the parameters desiredin the guarantee) as a challenge to the s servers. Throughout thissection, we will assume that each server i has an algorithm Ax;i

such that on challenge β it returns an answer Ax;iðβ; yiÞ to U (butwe have no pre-specified bounds on the time complexity of A).The protocol then outputs 1 or 0 by applying a (simple) Booleanfunction on the answers and the stored hash values.

Next, we state some definitions related to codes that we needin our proofs. An (error-correcting) code H with dimension kZ1and block length nZk over an alphabet of size q is any functionH : ½q�k-½q�n. A linear code H is any error-correcting code that is alinear function, in which case we correspond [q] with Fq. Amessage of a code H is any element in the domain of H. A codewordin a code H is any element in the range of H.

The Hamming distance Δðx; yÞ of two same-length strings is thenumber of symbols in which they differ. The relative distance δ of acode is minxayΔðx; yÞ=n, where x and y are any two differentcodewords in the code.

Definition 2. A ðρ; LÞ list-decodable code is any error-correctingcode such that for every vector e the ambient space of the code,the set E0 of codewords that are Hamming distance ρn or less frome is always L or fewer.

Geometric intuition for a ðρ; LÞ list-decodable code is that it isone where Hamming balls of radius ρn centered at arbitraryvectors in ½q�n always contain L or fewer other codewords.

Next we define Kolmogorov complexity.

Definition 3. The plain Kolmogorov complexity C(x) of a string x isthe minimum sum of sizes of a compressed representation of x,along with its decoding algorithm D, and a reference universalTuring machine T that runs the decoding algorithm.

Because the reference universal Turing machine size is con-stant, it is useful to think of C(x) as simply measuring the amountof inherent (i.e. incompressible) information in a string x.

Most strings cannot be compressed beyond a constant numberof bits. This is seen by a simple counting argument. C(x) measuresthe extent to which this is the case for a given string.

3.4. Protocol guarantees

For our hash function, we let H : Fkq-Fnq be the Reed–Solomoncode, with n¼q. Recall that for such a code, given messagex¼ ðx0;…; xk�1ÞAFkq, the codeword is given by HðxÞ ¼ ðPxðβÞÞβAFq

,

where PxðYÞ ¼∑k�1i ¼ 0xiY

i. It is well-known that such a code H hasdistance n�kþ1. By the Johnson Bound (Guruswami, 2004), H isð1�ε; 2q2Þ�list decodable, provided εZ

ffiffiffiffiffiffiffiffik=q

p.

Note that given any xAFkq and a random βA ½n�, HðxÞβ corre-sponds to the widely used polynomial hash. Further, HðxÞβ can becomputed in one pass over x with storage of only a constantnumber of Fq elements. (Keep in mind that after reading eachentry in x, the algorithm just needs to perform one addition andone multiplication over Fq.)

Using such a hash, we obtain the following guarantees:

Theorem 1. Let ε40, q be a prime power and sZ1 be an integer.Then

(i) Provided krε2sq, there exists an (s,s)-party verification protocolwith resource bound ððsþ1Þ log q;2s log qÞ and verification

guarantee ðε; f Þ, where for any xAFkq, f ðxÞ ¼ CðxÞ�Oðs log qÞ.(ii) Let krε2q. Assuming at most e servers do not respond to

challenges, there exists an (r,s)-party verification protocol withresource bound ðð2rþeþ1Þ log q;2s log qÞ and verification gua-

rantee ðε; f Þ, where for any xAFkq, f ðxÞ ¼ CðxÞ�Oðsþ log qÞ.

Further, in both the protocols, honest parties can implement theirrequired computation with a one-pass, Oðlog qÞ�space (in bits) and~Oðlog qÞ�update time data stream algorithm.

The claim on the computation requirements on the honest partiesfollows from the fact that the hash value HðxÞβ can be computed inone pass and with the claimed storage and update time.

3.4.1. Verification by storage enforcementFor part (i) of Theorem 1, we implicitly assume the following:

(a) we are primarily interested in whether some server was cheatingand not in identifying the cheater(s) and (b) we assume that allservers always reply back (possibly with an incorrect answer).

Proof of Theorem I (i). For the proof, we will assume that H :

Fk=sq -F

qq is the Reed–Solomon that is ð1�ε; LÞ�list decodable with

L¼ 2q2. (Note that the assumption on k implies that εZffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðk=sÞ=q

p

as required.) We begin by specifying the protocol. In the pre-processing step, the client U does the following on input x¼ðx1;…; xsÞAðFk=sq Þs:

1. Generate a random βAFq.2. Store ðβ; γ1 ¼Hðx1Þβ ;…; γs ¼HðxsÞβÞ and send xi to the server i

for every iA ½s�.

Server i on receiving x, saves a string yiA ½q�n. The server isallowed to use any computable function to obtain yi from xi.

During the verification phase, U does the following:

1. It sends β to all s servers.2. It receives aiAFq from server i for every iA ½s�. (ai is supposed to

be HðxiÞβ .)3. It outputs 1 (i.e. none of the servers “cheated”) if ai ¼ γi for

every iA ½s�, else it outputs a 0.

Here we assume that server i, on receiving the challenge, usessome algorithm Ax;i : Fq � ½q�n-Fq to compute ai ¼Axðβ; yiÞ andsends ai back to U.

The claim on the resource usage follows immediately from theprotocol specification. Next we prove its verification guarantee. LetTD ½s� be the set of colluding servers. Wewill prove that yT is large bycontradiction: if not, then using the list decodability of H, we willpresent a description of xT of size oCðxT Þ. Consider the following

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107 99

Page 5: Bidirectional data verification for cloud storage

algorithm that uses yT and an advice string vAðf0;1gjLjÞjTj, which isthe concatenation of shorter strings viA ðf0;1gjLjÞ for each iAT:

1. For every jAT , compute zj ¼ ðAx;jðβ; yjÞÞβAFq.

2. Do the following for every jAT: by cycling through all xjAFk=sq ,

retain the set LjDFk=sq such that for every uALj,

ΔðHðuÞ; zjÞrρn.3. For each jAT , let wj be the vjth string from Lj.4. Output the concatenation of fwjgjAT .

Note that by the definition of ρ, Δðzj;HðxjÞÞrρq, for every jAT .Next, since H is ðρ; LÞ�list decodable, for every jAT , there exists anadvice string vj such that wj¼xj. Thus, there exists an advice stringv such that the algorithm above outputs xT. Further, since H is theReed–Solomon code, there is an algorithm E that can compute adescription of H from k; q and s.3(Note that using this description,we can generate any codeword H(u) in Step 2.) Thus, we havedescription of xT of size jyT jþjvjþ∑jAT jAx;jjþjEjþðOðlog qþlog kþ log sÞþsÞ (where the term in parentheses is for encodingthe different parameters and T), which means that if jyT joCðxT Þ�jvj�∑jAT jAx;jj�jEj�ðsþOðlog qþ log kþ log sÞÞ ¼ f ðxÞ, then wehave a description of xT of size oCðxT Þ, which is a contradiction.Note that jvjrs⌈log L⌉¼ Oðs log qÞÞ, which implies that we needto have f ðxÞ ¼ CðxÞ�Oðs log qÞ, as desired. □

3.4.2. Proof of retrievabilityWe now argue how our protocol in part (i) gives a proof of

retrievability. (The argument is already present in the proof abovebut we make it explicit here.) For simplicity we focus on the caseof s¼1. As observed earlier the Reed–Solomon code H : Fkq-F

qq has

distance q�kþ1 and hence has relative distance δZ1�k=qZ1�ε2 (where the last inequality follows from our bound on k). Letρo1=2�ε2=2rδ=2. Let y be the string stored by server for theclient string x. Further, assume that for a random βAFq, the serveris able to return the correct answer (i.e. HðxÞβ) with probability atleast 1�ρ (using an algorithm Ax). Then we claim that the serverhas enough information to recover x. The recovery algorithm (i.e.the algorithm to retrieve x from y) is the same algorithm as thefour step algorithm above except the list output in Step 2 will onlyhave x in it (and hence Steps 3 and 4 are not required).4

In particular, the algorithm will be as follows:

� Compute z¼ ðAxðβ; yÞÞβA Fq .� Run the unique decoding algorithm for the Reed–Solomon codeon z and return the computed message x0.

The reason we will have x0 ¼ x is that the string z computed in firststep will have a unique by closest codeword: H(x). Further, thesecond step can be implemented by any efficient unique decodingalgorithm for the Reed–Solomon code (e.g. the well-knownBerlekamp–Massey algorithm).

One somewhat unsatisfactory part of the argument above isthat the server can cheat with probability 1=2þε2=2. We candecrease this probability at the expense of increasing the client'sstorage.

3.4.3. Catching the cheaters and handling unresponsive proversWe now observe that since the protocol in the Proof of

Theorem 1 part (i) checks each answer ai individually to see if itis the same as γi, it can easily handle the case when some proverdoes not reply back at all. Additionally, if the protocol outputs a0 then it knows that at least one of the provers in the colluding setis cheating. (It does not necessarily identify the exact set T.5)

At the cost of higher user storage and a stricter bound on thenumber of colluding provers, we show how to get rid of theseshortcomings.

A Reed–Solomon code RS : Fmq -Fℓq can be represented as asystematic code (i.e. the first k symbols in any codeword areexactly the corresponding message) and can correct r errors and eerasures as long as 2rþerℓ�m. Further, one can correct from rerrors and e erasures in Oðℓ3Þ time. The main idea in the followingresult is to follow the same protocol as before, but instead ofstoring all the s hashes, U only stores the parity symbols in thecorresponding Reed–Solomon codeword.

Proof of Theorem 1 (ii). Let H : Fkq-Fqq be the Reed–Solomon

code that is ð1�ε; LÞ�list decodable for L¼ 2q2. (Note that theassumption on k implies that εZ

ffiffiffik

pq as required.) We begin by

specifying the protocol. First, define x̂i, for iA ½s�, to be the string xiextended to the vector in Fkq, which has zeros in the positions thatdo not belong to prover i. Further, for any subset TD ½s�, definex̂T ¼∑iAT x̂i. Finally, let RS : Fsq-Fqℓ be a systematic Reed–Solo-mon code where ℓ¼ 2rþeþs.6

In the pre-processing step, the verifier U does the following oninput xAFkq:

1. Generate a random βAFq.2. Compute the vector v¼ ðHðx̂1Þβ ;…;Hðx̂sÞβÞAFsq.3. Store ðβ; γ1 ¼ RSðvÞsþ1;…; γ2rþ e ¼ RSðvℓÞÞ and send xi to the

prover i for every iA ½s�.

Prover i on receiving xi, saves a string yiA ½q�n. Again, the proveris allowed to use any computable function to obtain yi from xi.

During the verification phase, U does the following:

1. It sends β to all s provers.2. For each prover iA ½s�, it either receives no response or receives

aiAFq. (ai is supposed to be Hðx̂iÞβ .)3. It computes the received word zAFℓq , where for iA ½s�, zi ¼ ?

(i.e. an erasure) if the ith prover does not respond else zi¼aiand for so irℓ, zi ¼ γi.

4. Run the decoding algorithm for RS to compute the set T 0D ½s� tobe the error locations. (Note that by Step 2, U already knows theset E of erasures.)

We assume that prover i on receiving the challenge, uses analgorithm Ax;i : Fq � ½q�n-Fq to compute ai ¼Axðβ; yiÞ and sends aiback to U (unless it decides not to respond).

The claim on the resource usage follows immediately from theprotocol specification. We now prove the verification guarantee.Let T be the set of colluding provers. We will prove that withprobability at least 1�ρ, U using the protocol above computes∅aT 0DT (and jyT j is large enough). Fix a βA ½n�. If for this β, Uobtains T 0 ¼∅, then this implies that for every iA ½s� such thatprover i responds, we have ai ¼Hðx̂iÞβ . This is because of ourchoice of RS, the decoding in Step 4 will return v (which in turn3 In particular, the k=s� q generator matrix G of H has as its columns the vector

ð1;α;α2;…;αk=s�1ÞT for every αAFq . For every message xAFk=sq , the codeword

HðxÞ ¼ x � G.4 One might not have access to the algorithmA– however, one can always send

all the challenges βAFq to the server. Actually one might be able to query onlynrq many βAFq . To make sure the proofs go through one has to make sure that nis large enough so that 1�k=nZ1�ε2.

5 We assume that identifying at least one prover in the colluding set ismotivation enough for provers not to collude.

6 Note that both H and RS are Reed–Solomon codes but they are used fordifferent purposes in the proof.

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107100

Page 6: Bidirectional data verification for cloud storage

allows us to compute exactly the set T 0DT such that for everyjAT 0, ajaHðx̂Þβ).7 Thus, if the protocol outputs a T 0a∅ with

probability at least 1�ρ over the random choices of β, this impliesthat ΔðHðx̂T Þ; ð∑jATAx;jðβ; yjÞÞβA ½n�Þrρn. Using an argument similarto the proof of part (i), this further implies that jyT jZCðxT Þ�s�Oðlog ðsqLÞÞ. The proof of complete by noting that log L isOðlog qÞ. □

3.5. Sampling

Although PGV prefers and is able to efficiently handle completeverification of the data (i.e. all the data blocks in a file), PDP usessampling to scale with the increment in file sizes. To make a faircomparison, we now discuss a probabilistic framework of errordetection using PGV. Let us assume a scenario where the servercheats8 t blocks out of an n-block file. Let rPDP and rPGV be thenumber of sampled blocks during the verification in PDP and PGV,respectively. Let p be the probability that the server can avoiddetection of its cheating. For PDP, p is the probability that none ofthe sampled blocks match the blocks on which the server cheats;i.e., we have: p¼ ð1�t=nÞrPDP which leads to rPDP ¼ ððlog 1=pÞ=ðlog n=ðn�tÞÞÞ. For PGV, p will have two components, one compo-nent is due to the sampling error and another component is theverification error ε (we get the latter from Theorem 1). So,p¼ ð1�t=nÞrPGV þε. If we use weight 0rαr1 as a factor of p forthese two components, we have: ð1�αÞp¼ ð1�t=nÞrPGV and αp¼ ε.The first term gives us: rPGV ¼ ððlog ð1=ð1�αÞÞpÞ=ðlog n=ðn�tÞÞÞwhich is equivalent to rPGV ¼ ðrPDPþðlog 1=ð1�αÞÞ=ðlogn=ðn�tÞÞÞ.When t¼1% of n, if we want a 99% detection rate (i.e. p¼1%), andα¼ 8

10 , we get: rPGV ¼ ðrPDPþ23Þ blocks. Now, from Theorem 1, wehave εr

ffiffiffiffiffiffiffiffik=q

p, or krε2q, where q is the number of field elements

and k is the number of symbols in each block. In turn, this boundsthe block size for PGV to be br log 8 qε2q bytes. When we useGFð232Þ, if we want a 99% detection rate (p¼1%), we get b¼1677α2 KB since αp¼ ε. For α¼ 8

10 , the block size can be as largeas 1.04 MB.

3.6. Repeatability

PGV has bounded repeatability i.e. inability to allow an unlimitednumber of remote verifications. We show that there are practicalways to overcome this issue since the token generation and verifica-tion time are very fast in PGV (Section 5). Also, for the proof ofownership problem, repeatability is not an issue since the server willalways have the original data. The client/verifier can keep a buffer oftokens when it sends out the data to the outsourced storage. Now,when a normal read access occurs, the client can check whether it hasenough tokens in the buffer. If not, it can regenerate new tokens andkeep the buffer up-to-date. Repeating this process, PGV can handleunlimited verification by piggybacking the generation of new tokenswith regular block access operations.

To make the discussion more concrete, let us compare PDP(which has unbounded repeatability) and PGV (with the piggy-backing scheme). In particular let ðgPDP ; vPDPÞ and ðgPGV ; vPGV Þ be thegeneration time and verification time for one block for PDP andPGV respectively. Further define Rg ¼ gPDP=gPGV . Let B be themaximum number of tokens that the PGV keeps at any point oftime per block. Since PDP has unbounded repeatability, it onlyneeds to store one token per block. Next we try to figure out what

B should be in order to keep the overall overhead of PGV stillsmaller than PDP.

At the beginning both PDP and PGV need to generate thetokens. Thus, PGV spends B � gPGV time while PDP spends gPDP time.Thus, to be ahead of PDP at this stage, we need to make sure

BrgPDPgPGV

¼ Rg : ð1Þ

Now later on, we will have verification for the block interspersedwith normal reads for the block. Let us assume that C40 is themaximum number of verification calls between two normal readsfor a block. Now consider any “run” of x consecutive verificationcalls followed by a normal read (so we have xrC). Note that inthis case for PGV we (i) need to make sure that BZx as we cannot“replenish” the buffer till get to the normal read and (ii) make surethat the overhead incurred by PGV is smaller than PDP. For (ii) wenote that the time expended by PDP is x � vPDP while the timeexpended by PGV is xðvPGV þgPGV Þ (as it needs to perform xverifications and replenish as many tokens in its buffer). Thus,PGV would have a lower overhead if

vPGV þgPGV rvPDP : ð2ÞNote that we have ended up with the following constraints on B:

CrBrRg :

In our experiments (Section 5), we have found that (2) is satisfied.(In fact, we observe that gPGV ZvPGV . Further, vPDPZ 10 � gPGV , whichimplies that PDP is at least five times as slow as PGV with B tokensduring the verification phase.) Finally, we observe in our experimentsthat RgZ10. So if we pick B¼C and if we have Co10 (the latter is arealistic assumption for the cloud computing applications we envi-sion for PGV), then PGV comes out ahead even in the tokengeneration part.

3.7. Extensions

Some of the existing schemes (Ateniese et al., 2007; Wang Qet al., 2009; Wang C et al., 2009; Bowers et al., 2009a) discusspublic verifiability in terms of a trusted third party auditor's abilityto verify the remote storage. Now, when the client is honest, thiscan be very straightforward: client sends β and HðxiÞβ to theauditor, auditor uses server i's algorithm Ax;i and compareswhether HðxiÞβ and Ax;iðβ; yiÞ are equal. However, this naïveapproach fails in arbitration when the client is dishonest, becauseit can send wrong β and HðxiÞβ to the auditor in the first place. Tothe best of our knowledge, none of the existing schemes discuss orhandle this scenario. We briefly outline two methods of handlingpublic verifiability when the client is dishonest. The first one is toretrieve stored data from other servers and see whether yi isconsistent as a codeword. This approach although storage efficient,is intense in communication and computation overhead since it isalmost equivalent to the overhead of constructing the originaldata. The other method is to use locally decodable codes(Yekhanin, 2007) to reconstruct only yi without using all theblocks. However, this approach requires considerable amount ofextra storage but efficient in terms of communication and compu-tation overhead.

3.8. Summary of properties

We can summarize the properties that our scheme provides asfollows:

� The amount of storage for the verifier depends only on ε andthe logarithm of the block size (Theorem 1).

7 We will assume that T \ E¼∅. If not, just replace T by T\E.8 Cheating in the case of PGV means that the server stores less than the

Kolmogorov complexity of the block. This example can handle the case whensufficiently bits from a block has been erased. For PDP this means that the serveralters the block so that it cannot re-create the original block.

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107 101

Page 7: Bidirectional data verification for cloud storage

� No additional storage for the provers (Theorem 1).� Constant amount of bandwidth usage for a challenge (Theorem 1).� Reconstructability of the data from challenge responses alone,

provided the prover(s) pass the verification protocol with highenough probability (Section 3.4.2).

� Assurance that provers have stored as much data as there isinherent information in the data sent, provided the prover(s) pass the verification protocol with high enough probability(Section 3.4.1).

� Low computational complexity for the verifier and the prover(Theorem 1).

� Prover can store original unmodified data and follow protocol(since our protocol does not modify the original data).

4. Considerations for implementation

Basic PGV operations include polynomial evaluation as well asaddition and multiplication over a finite field. In practice, there arefinite field libraries one can use such as James Plank's Galois fieldlibrary (Plank, 2003).

4.1. Polynomial evaluation over finite field

The core of PGV is an evaluation of a polynomial over a finite fieldfor token generation and verification. A naïve way of evaluating apolynomial of degree n, PðxÞ ¼ anxnþan�1xn�1 þ⋯þa0, over afinite field at a point β is iteratively computing the n powers βi fori¼ 1;…;n�1. This method is inefficient as it requires 2n�1 multi-plications and n additions.

However, in PGV, we use an efficient technique proposed byHorner (1819) for polynomial evaluation. This technique iterativelyevaluates PðβÞ as ð⋯ððanβþan�1Þβþan�2Þβþ⋯Þβþa1Þβþa0. It ismore efficient as it requires only n multiplications and n additions.

4.2. Implementation of addition and multiplication of field elements

There are known methods to implement efficient addition andmultiplication over a finite field. The “standard” representation ofelements in Fq for q¼ ps with prime p is as a polynomial of degrees�1 over Fp and addition of two elements in Fq is a simplepolynomial addition. When p¼2, which will of interest to us, onecan think of the polynomial representation as a bit string and theaddition operation as bit-wise xor. For multiplication, it is con-venient to think of every non-zero element in a field Fq in its“multiplicative” form, i.e. as a power of the “generator” γ. That is,every element αAFq\f0g can be represented as α¼ γj for somejAf0;1;…; q�2g. Thus, we have γi � γj ¼ γðiþ jÞmodðq�1Þ. This impliesthat the multiplication of two numbers in the standard represen-tation (which will be our default) can be done by two tablelookups into a “log” table to convert from the standard represen-tation into the multiplicative representation and one look-up intothe “anti-log” table to do the converse. Further, we will need onemodular addition. The simplicity of these operations contributes toPGV's overall simplicity and low overhead.

5. Client verifying storage provider experiments

This section presents various performance aspects of PGV aswell as comparisons with the performance of a crypto-basedscheme, PDP (Ateniese et al., 2007). We emphasize that we haveonly considered the main schemes in PDP, although there aremany variants of the main schemes. Since the variants of eachmain scheme make design tradeoffs on multiple properties, it isdifficult for us to consider those variants altogether.

Our main metrics of comparison are token (hash) generationtime (Section 5.3), challenge generation and verification time(Section 5.4), and storage overhead (Section 5.5). We demonstratePGV's low overhead in these metrics compared to other schemes,which leads to scalability of PGV.

5.1. Experimental platform and parameters

We use open source C libraries on an Intel Centrino Duomachine at 1.73 GHz with 2.5 GB memory running Linux kernel2.6.32 to compare PGV with existing schemes. We use a singlethread in all implementation and measurements. For crypto-graphic operations, we use OpenSSL 0.9.8k. For field operationsand erasure coding, we use James Plank's Galois field library(Plank, 2003) and Jerasure library (Plank, 2008).

5.2. Performance of basic primitives

Although we mainly compare PGV to PDP, we recognize that adirect comparison is inherently prone to unfairness. This is due tothe fact that they rely on several different types of primitives/techniques. Thus, we first show the benchmark results of the basicprimitives. Table 1 shows a benchmark summary on the experi-mental platform of cryptographic algorithms. Table 2 shows abenchmark summary of Galois field multiplication and encoding.Table 3 summarizes basic primitives used by different schemesand their instantiated algorithms including PoW which we discussin Section 6.

On the experimental platform, the measurements were per-formed on files of different sizes ranging from 1 MB to 1024 MB.Schemes are configured to detect any corruption in the file blocksas well as sampling. Also, we report performances both includingand excluding the I/O to give a clear idea about the completescheme and core operations overhead, respectively. When com-paring with PDP, we have chosen parameters that are favorable forit. With PDP, we use GFð232Þ and 16 KB block size since it performs

Table 1Performance of different primitives.

Block size 1024 B 8192 B

Algorithm Rate (MB/s)

AES 81 83RC4 184 184MD5 231 288SHA1 148 170

Table 2Performance of multiplication and encoding.

Field Multiplication (MB/s) Encoding (MB/s)

GFð216Þ 269 110

GFð232Þ 106 25

Table 3Basic primitives for different schemes.

Scheme Primitives Algorithms used

PDP Modular exponentiation RSAPoW Hash tree SHA256PGV Multiplication Poly hash

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107102

Page 8: Bidirectional data verification for cloud storage

best at 16 KB block size.

5.3. Token generation time

Since the constructions and terminologies are different fromscheme to scheme, we first briefly discuss what we measure inpre-processing and token generation time. We note that token isalso referred to as hash, tag, or authenticator in the literature. In allof the schemes, the file was divided into blocks of the same sizeand then tokens were generated for each block.

PGV's token generation time includes the generation of randomnumbers as key, file I/O and poly hash generation for the fileblocks. Figure 1 shows PGV's token generation time for various filesizes (64 MB and 256 MB) and block sizes (4 KB, 8 KB, 16 KB,32 KB, and 64 KB), including and excluding the I/O. Although wehave done experiments with file sizes up to 1024 MB, for bettervisibility, we plot the data from a few file sizes. With theincrement in the block size, the token generation time becomesfaster. PGV can generate token for a 1 GB file with 64 KB block sizein 36.91 s excluding the I/O.

For PDP, the token generation time includes the time requiredto generate the asymmetric keys, file I/O and per-block tokengeneration time. Figure 4 shows the token generation timecomparison between PDP and PGV. (For both schemes we used ablock size of 16 K.) As shown in the figure, PGV has little overheadas it just evaluates the polynomial hash. PDP has higher overheadas it relies on computationally intensive cryptographic primitives.It uses asymmetric cryptographic primitive based on modularexponentiation. On average, PGV is approximately 25 times fasterthan PDP in token generation for different file sizes.

5.4. Challenge generation, verification, and sampling analysis

Now, we compare the performance in challenge generation andverification of different schemes including PGV. This comparisonconsiders the per-verification overhead holistically (excluding the

client–server communication overhead, which is one networkround-trip); it includes all of the following if required by a scheme– challenge generation at the client side, proof computation at theserver side, and verification at the client side.

PGV's verification time includes the poly hash generation usingthe stored random keys and file I/O. Figure 2 shows PGV'sverification time for various file sizes (64 MB and 256 MB) andblock sizes (4 KB, 8 KB, 16 KB, 32 KB, and 64 KB), including andexcluding I/O. (The verification time is for all the blocks in a file.)Although we have done experiments using files of size up to1024 MB, again we show a few in the plots for better visibility. PGVis able to verify a file of size 1 GB with 64 KB block sizes in 35.41 sexcluding the I/O.

Figure 5 compares PDP verification time with PGV consideringboth completeness (i.e. in both cases we verify all the blocks) andsampling (Section 3.5 with the following setting of parameters:t=n¼ 1%, p¼1%, α¼0.8 and q¼ 232). PDP incurs the most over-head due to asymmetric cryptographic operations in all threephases. It also shows the sampling verification time with 99%detection rate. As discussed earlier, in the probabilistic framework,PDP can guarantee 99% detection rate by sampling 460 blocks.For PGV, it can guarantee the same detection rate by sampling 483blocks. As in token generation, PGV outperforms PDP since itinvolves only a number of simple finite field additions and

02468

1012141618

64MB256MB

64MB256MB

64MB256MB

64MB256MB

64MB256MB

Tim

e (S

ec)

Block Size

PGV w/o I/OPGV w/ I/O

64KB32KB16KB8KB4KB

Fig. 1. Token generation time for PGV.

02468

1012

64MB256MB

64MB256MB

64MB256MB

64MB256MB

64MB256MB

Tim

e (S

ec)

Block Size

PGV w/o I/OPGV w/ I/O

64KB32KB16KB8KB4KB

Fig. 2. Challenge generation and verification time for PGV.

00.20.40.60.8

11.21.41.61.8

2

1 64 256 512 1024

Toke

n S

tora

ge(M

B)

File Size (MB)

4KB Block8KB Block

16KB Block32KB Block64KB Block

Fig. 3. Tag storage for PGV for different file sizes.

0.1

1

10

100

1000

10000

1 64 256 512 1024

Per

form

ance

(Sec

)

File Size (MB)

PDPPDP w/o I/O

PGVPGV w/o I/O

Fig. 4. Preprocessing and token generation time comparison for PDP and PGV.

0.1

1

10

100

1000

10000

100000

64 256 512 1024

Per

form

ance

(Sec

)

File Size (MB)

PDP-CompletePDP-Sampling 99% Confidence

PGV-CompletePGV-Sampling 99% Confidence

Fig. 5. Challenge generation and verification time comparison for PDP and PGV.

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107 103

Page 9: Bidirectional data verification for cloud storage

multiplications. On average, PGV verification is almost 120 timesfaster than PDP verification.

5.5. Storage overhead

A remote storage verification scheme typically requires addi-tional storage on top of the original data for storing tokens and anyother necessary information. As with pre-processing, we do notconsider keys in this overhead, since the size is negligiblecompared to the original data. This additional storage overheadcan be present at the client side, server side, or both.

Figure 3 shows the PGV storage overhead for different file sizesand block sizes. Comparison of storage overhead with PDP isshown in Fig. 6. PDP stores its tokens (authenticators) on theserver side. These tokens are usually 1024 bits long. Unlike theseschemes, PGV does not have any server-side storage overhead.We only need to store tokens at the client side.

6. Server verifying client experiments

In this section, we compare PGV with PoW (Halevi et al., 2011) interms of client time (the time required by the prover to compute theproof of ownership of deduplicated data) (Section 6.1). We alsocompare both the schemes' network time to a naive approach where,instead of deduplication, the complete file is transferred over thenetwork (Section 6.2). Since, the results reported in Halevi et al.(2011) were performed on a Xeon 2.53 GHz machine, we alsomeasure PGV performance on a similar platform (Xeon 2.4 GHz)for fair comparison. Other experimental parameters remain the sameas described in Section 5.1. In particular, we only use a single threadfor performance measurement. However, for PoW, it is not men-tioned whether the measurements were parallelized or not. Also, westress that the results reported for PoW was for their efficientscheme which gives the weakest security guarantee among theirproposed schemes.

6.1. Client time

For PGV, client time corresponds to the token generation timeas mentioned in Section 5.3 that includes the generation ofrandom numbers as key, file I/O and poly hash generation forthe file blocks. In PoW, this time includes the time for reading thefile from disk, computing its SHA256 value, performing reductionand mixing phase as well as computation of Merkle tree over thisoutput. Figure 7 compares client times for PGV and PoW fordifferent file sizes. As we can see from the figure, PGV performsfaster than PoW in client time for the reported file sizes from32 MB to 512 MB. However, as the file size increases, the perfor-mance follows similar trend. For instance, PoW and PGV takes15.29 s and 16.98 s, respectively for a 1024 MB file. However, note

that PoW is optimized only for proof of ownership whereas PGVsupports two-way verification with proof of storage enforcement.

6.2. Network time

Network time in PGV corresponds to sending random numbersβ to the client and get the computed hashes back from the client.As we have discussed in Section 5.4, PGV can guarantee 99%detection by checking 483 blocks. So, in total PGV transfers around30 K data on the network as a part of the protocol data. For similarguarantee, PoW reports 20 K data transfer requirement as it needsto transfer 20 leaves to the provers. Figure 8 shows the overalltime (client time and network time) required by PGV and PoW,compared to a naive scheme which always transfers the completefile over the network without deduplication for different file sizes.Our results show that PGV performs better than both PoW andnaive approach. Note that, both PGV and PoW will requireadditional time to verify the hashes which is 0.5 s for PGV and0.6 ms for PoW.

7. Related work

Previous work on remote storage verification spreads over twobroad domains – coding and cryptography. Since there are manyschemes that provide different properties, we briefly survey thedesign space and compare our scheme to others in this section.Although many previous schemes provide a good set of properties,we emphasize that our scheme can do so with a simple construc-tion with low performance overhead.

Table 4 summarizes our comparison. It is by no means acomprehensive list; rather, we have included a few schemes tohave a representative set (Table 5).

0.0001

0.001

0.01

0.1

1

10

1 64 256 512 1024

Sto

rage

(MB

)

File Size (MB)

PDP Token StoragePGV Token Storage

Fig. 6. Storage overhead comparison of PDP and PGV.

0.1

1

10

32 64 256 512

Per

form

ance

(Sec

)

File Size (MB)

PoWPGV

Fig. 7. Client time comparison for PoW and PGV.

0.1

1

10

100

32 64 256 512

Per

form

ance

(Sec

)File Size (MB)

NaivePoWPGV

Fig. 8. Overall time comparison (client time þ network time) for PoW and PGVwith a naive approach with no deduplication over a 100 Mbps network.

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107104

Page 10: Bidirectional data verification for cloud storage

Table 4Summary of the existing schemes compared to the proposed scheme, PGV.

Metric/scheme SFC (Schwarzand Miller, 2006)

C-POR (Shachamand Waters, 2008)

SDS(Ren et al., 2012)

HAIL(Bowers et al., 2009a)

PDP(Ateniese et al., 2007)

EPV(Wang Q et al., 2009)

SEC(Golle et al., 2002)

PGV

Method Code Crypto Code Code Crypto Crypto Crypto CodeProof of storage enforcement (Section 7.1) No No No No No No Yes YesProof of retrievability (Section 7.2) No Yes No No No Yes No YesData transformation (Section 7.3) Yes No Yes Yes No No Yes NoDetection completeness (Section 7.4) Complete Partial Partial Partial Partial Partial Partial CompleteSampling (Section 7.4) No Yes Yes Yes Yes Yes Yes NoRepeatability (Section 7.5) Yes Yes No Yes Yes Yes Yes No

Table 5Asymptotic performance comparison of existing schemes with the proposed scheme. We assume that data contains n symbols each of size log q bits, that it is then divided into s equal-sized blocks, with the blocks being distributedamong the servers. We compare the token generation and verification for all s blocks. ξ is the fraction of symbols checked during each verification for the schemes which check partial data corruption. For cryptographic schemes, weassume Em is the cost of performing modular exponentiation modulo m. Token generation/verification complexity is based on the number of bit operations. Storage and communication complexity is based on the number of bits.PGV-S refers to our proposed simple scheme where we generate one token for each server and PGV-E is a variation where a single token can verify multiple servers. Additional server storage refers to the amount of data that theserver stores in addition to the original data, if any.

Operation/scheme SFC (Schwarzand Miller, 2006)

POR (Juelsand BSK, 2007)

SDS(Ren et al., 2012)

HAIL(Bowers et al., 2009a)

PDP(Ateniese et al., 2007)

EPV(Wang Q et al., 2009)

SEC(Golle et al., 2002)

PGV-S PGV-E

Token generation Oðn log qÞ Oððn2Þlog qÞ Oðn log qÞ Oðn log n log qÞ OðnEmÞ OðnEmÞ Oðn EmÞ Oðn log qÞ Oððn=sÞlog qÞChallenge generation Oðn log qÞ Oððn=ξÞlog qÞ Oððn=ξÞlog qÞ Oððn=ξÞlog qÞ Oððn=ξÞEmÞ Oððn=ξÞEmÞ Oððn=ξÞEmÞ Oðn log qÞ Oðn log qÞChallenge verification O(n log n log q) Oð1Þ O(n log n log q) Oðnlog n log qÞ Oððn=ξÞEmÞ Oððn=ξÞEmÞ Oððn=ξÞEmÞ Oð1Þ Oð1ÞClient storage Oð1Þ O(s log q) O(s log q) Oð1Þ Oð1Þ O(s logm) O(s logm) O(s log q) Oð1ÞAdd. server storage 0 0 0 O(n log q) O(s logm) Oð1Þ O(s logm) 0 0Communication complexity O(s log q) Oððn=ξÞlog qÞ Oððn=ξÞlog qÞ Oððn=ξÞlog qÞ O(s logm) O(s logm) O(s logm) O(s log q) O(s log q)

M.I.H

usainet

al./Journal

ofNetw

orkand

Computer

Applications

45(2014)

96–107

105

Page 11: Bidirectional data verification for cloud storage

7.1. Storage enforcement

Our scheme provides the property of storage enforcement. Thismeans roughly that, in order to pass our verification, a cloudstorage provider has to commit as much storage space as the sizeof the original data. This property removes storage saving as anincentive of cheating for a storage provider.

To the best of our knowledge, there is only one previous schemethat provides a similar property. Golle et al. (2002) constructs acryptographic primitive called Storage Enforcing Commitment (SEC)that probabilistically guarantees that a server commits as muchstorage space as the size of the original data in order to correctlyanswer the challenges of a client. However, SEC does not provide theproof of retrievability guarantee and relies on an expensive crypto-graphic primitive as well as transformation of the original data.

7.2. Proof of retrievability

Our scheme provides a proof of retrievability. This meansroughly that, if a cloud storage provider can pass our verification,then it is possible to reconstruct the original data from what thecloud storage stores that might be different from the original data.Intuitively, this property arises from the fact that if a client collectsenough responses from its cloud storage, then the client canreconstruct the original data out of the responses gathered. Tothe best of our knowledge, our scheme is the first to provablyprovide both storage enforcement and a proof of retrievability bycombining Kolmogorov complexity and list decoding as we detailfurther in Section 3. (Further, we show that proof of retrievabilityimplies storage enforcement.)

Juels and BSK (2007) and Juels and Oprea (2013) were the firstto propose this property. Since their scheme only supports alimited number of verifications (the subject of Section 7.5) andno public verifiability, Shacham and Waters (2008) constructed ascheme that provides both unlimited and public verifiability byintegrating cryptographic primitives. Dodis et al. (2009) laterstudied variants of existing POR schemes and Bowers et al.(2009b) showed how to tune different parameters to achievevarious performance goals.

Other schemes study a similar property called provable datapossession (PDP) that provides a probabilistic proof that a thirdparty stores a file. Examples include Ateniese et al. (2007),Curtmola et al. (2008), Wang Q et al. (2009), Erway et al. (2009),and Ateniese et al. (2008). Although the guarantee sounds similarto the proof of retrievability, they are not equivalent. Proof ofretrievability guarantees both the existence and extraction of thedata whereas provable data possession just guarantees the former.In addition, these schemes rely on expensive cryptographic pri-mitives; this makes it difficult for them to scale up to a largeamount of data without sacrificing completeness. We discuss thisfurther in Section 7.4.

7.3. Data transformation

Unlike our scheme, some of the previous schemes require datatransformation such as encryption, erasure-coding, and insertionof extra blocks (Schwarz and Miller, 2006; Dodis et al., 2009; Golleet al., 2002; Juels and BSK, 2007; Juels and Oprea, 2013; Ren et al.,2012; Bowers et al., 2009a). The advantage of these schemes isthat they can provide additional properties such as confidentialitywith encryption and erasure tolerance with encoding. However,the problem is that data transformation penalizes clients dealingwith honest storage providers; a client needs significant pre-processing, resulting in substantial performance overhead fornormal read and write operations. Also, a cloud storage might

need to spend more space than what is necessary to store extrainformation.

Thus, we argue that these properties should be orthogonal tothe core verification properties; in our scheme, we do not makeany distinction between encrypted data or plain data, hence aclient can choose to encrypt data if so desired. Moreover, we canextend our scheme to efficiently integrate erasure-coding as wediscuss in Section 3.

7.4. Overhead, sampling, and completeness

In general, schemes that either rely on cryptographic primitivesor require data transformation have significant overhead involvedduring pre-processing or verification (Schwarz and Miller, 2006;Dodis et al., 2009; Golle et al., 2002; Juels and BSK, 2007; Juels andOprea, 2013; Ren et al., 2012; Bowers et al., 2009a; Ateniese et al.,2007, 2008; Curtmola et al., 2008; Wang Q et al., 2009; Erway et al.,2009). Thus, it is difficult for these schemes to scale up and deal witha large amount of data.

A typical way to mitigate this issue is sampling that provides aprobabilistic guarantee, i.e., a server accesses (possibly small) x% ofdata in order to verify the entire data with some (possibly high)probability p. Although this technique reduces the overall verifica-tion overhead, it comes at a cost of sacrificing completeness;depending on the verification probability, it is possible that corrup-tion in a small portion of data can go undetected in sampling whichmight in turn weaken the guarantee that proof of retrievabilityprovides.

Our scheme favors completeness, however we also provideprobabilistic guarantees based on sampling as discussed in Section3. Moreover, our low verification overhead makes it easier to scale upto a large amount of data. Our evaluation in Section 5 shows ourscalability. For example, verification of a 10 MB file is possible in lessthan one-tenth of a second.

7.5. Repeatability

Repeatability is the ability to perform multiple, possibly a verylarge number of verifications of data without pre-processing thedata multiple times. Previous schemes such as Schwarz and Miller(2006), Bowers et al. (2009a), Ateniese et al. (2007), Wang Q et al.(2009), and Golle et al. (2002) provide this property.

The low challenge generation overhead of PGV allows us togenerate a large number of challenges quickly. Section 5 showsthat, for a data size of 1000 KB, it takes roughly 9 s to generateenough challenges to verify monthly for the next 50 years on acommodity PC (Intel Centrino Duo). We also show how PGV canprovide repeatability in an alternate way in Section 3.

8. Conclusions

This paper presents PGV (Pretty Good Verification), a multi-purpose storage enforcing remote verification scheme that utilizespolynomial hash for cloud storage verification. In particular, itssimplicity allows for bi-directional verification (i.e., the client canverify that the server is storing its data and the server can verifythat a (new) client has data that it is claiming to have). Using anovel combination of Kolmogorov complexity and list decoding,we prove that the polynomial hash provides a strong storageenforcement property as well as proof of retrievability. Ourexperimental results support our claims of low overhead in pre-processing, token generation, verification, and additional storagespace compared to the existing schemes. Our overall results showthat it is a good, practical choice for bidirectional cloud storageverification.

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107106

Page 12: Bidirectional data verification for cloud storage

References

Allalouf M, Arbitman Y, Factor M, Kat RI, Meth K, Naor D. Storage modeling forpower estimation. In: Proceedings of SYSTOR 2009: the Israeli experimentalsystems conference, SYSTOR '09. New York, NY, USA: ACM; 2009. p. 3:1–10.

Ateniese G, Burns R, Curtmola R, Herring J, Kissner L, Peterson Z, et al. Provable datapossession at untrusted stores. In: Proceedings of the 14th ACM conference oncomputer and communications security (CCS 2007); 2007.

Ateniese G, Pietro RD, Mancini LV, Tsudik G. Scalable and efficient provable datapossession. In: Proceedings of the 4th international conference on security andprivacy in communication networks (SecureComm '08); 2008.

Aumann Y, Lindell Y. Security against covert adversaries: efficient protocols forrealistic adversaries. J Cryptol 2010;23(2):281–343.

Bierbrauer J, Johansson T, Kabatianskii G, Smeets BJM. On families of hash functionsvia geometric codes and concatenation. In: Proceedings of 13th annualinternational cryptology conference (CRYPTO); 1993. p. 331–42.

Bowers KD, Juels A, Oprea A. HAIL: a High-Availability and Integrity Layer for cloudstorage. In: Proceedings of the 16th ACM conference on computer andcommunications security (CCS 2009); 2009a.

Bowers KD, Juels A, Oprea A. Proofs of retrievability: theory and implementation.In: Proceedings of the 2009 ACM workshop on cloud computing security(CCSW '09); 2009b.

Canetti R. Security and composition of cryptographic protocols: a tutorial (Part I).SIGACT News 2006;37(3):67–92.

Curtmola R, Khan O, Burns R, Ateniese G. MR-PDP: multiple-replica provable datapossession. In: Proceedings of the 28th international conference on distributedcomputing systems (ICDCS '08); 2008.

Dodis Y, Vadhan SP, Wichs D. Proofs of retrievability via hardness amplification. In:Proceedings of the 6th theory of cryptography conference (TCC); 2009.

Erway C, Küpçü, A, Papamanthou C, Tamassia R. Dynamic provable data possession.In: Proceedings of the 16th ACM conference on computer and communicationssecurity (CCS 2009); 2009.

Freivalds R. Probabilistic machines can use less running time. In: IFIP congress;1977. p. 839–42.

Golle P, Jarecki S, Mironov I. Cryptographic primitives enforcing communicationand storage complexity. In: Proceedings of the 6th international conference onfinancial cryptography (FC'02); 2002.

Guruswami V. List decoding of error-correcting codes, no. 3282. In: Lecture notes incomputer science, Springer; 2004, /http://www.springer.com/computer/securityþandþcryptology/book/978-3-540-24051-8S.

Halevi S, Harnik D, Pinkas B, Shulman-Peleg A. Proofs of ownership in remotestorage systems, Cryptology ePrint Archive, Report 2011/207, ⟨http://eprint.iacr.org/⟩; 2011.

Horner WG. A new method of solving numerical equations of all orders, bycontinuous approximation. Philos Trans R Soc Lond 1819;109:308–35.

Juels A, Kaliski Jr., Burton S. Pors: Proofs of Retrievability for Large Files. In:Proceedings of the 14th ACM Conference on Computer and CommunicationsSecurity (CCS '07), ACM, New York, NY, USA, 2007, 584—597.

Juels A, Oprea A. New approaches to security and availability for cloud data.Commun ACM 2013;56(2):64–73. http://dx.doi.org/10.1145/2408776.2408793.

Kolmogorov AN. Three approaches to the quantitative definition of information.Probl Inf Transm 1965;1(1):1–7.

Li M, Vitányi PMB. An introduction to Kolmogorov complexity and its applications,graduate texts in computer science. 3rd Edition New York: Springer; 2008.

Moore RL, D'Aoust J, Mcdonald R, Minor D. Disk and tape storage cost models. In:Archiving 2007, URL ⟨http://www.imaging.org/conferences/archiving2007/details.cfm?pass=21⟩; 2007.

Plank JS. GFLIB C Procedures for Galois field arithmetic and Reed Solomon Coding,website ⟨http://web.eecs.utk.edu/�plank/plank/gflib/index.html⟩; 2003.

Plank JS. Jerasure: a library in C/Cþþ facilitating erasure coding for storageapplications, website ⟨http://web.eecs.utk.edu/�plank/plank/papers/CS-08-627.html⟩; 2008.

Ren K, Wang C, Wang Q, et al. Security challenges for the public cloud. IEEEInternet Comput 2012;16(1):69–73.

Schwarz T, Miller EL. Store, forget, and check: using algebraic signatures to checkremotely administered storage. In: Proceedings of the IEEE internationalconference on distributed computing systems (ICDCS'06); 2006.

Shacham H, Waters B. Compact proofs of retrievability. In: Proceedings of the 14thannual international conference on the theory and application of cryptologyand information security (ASIACRYPT 2008); 2008.

Wang C, Wang Q, Ren K, Lou W. Ensuring data storage security in cloud computing.In: Proceedings of the 17th IEEE international workshop on quality of service(IWQoS 2009); 2009.

Wang Q, Wang C, Li J, Ren K, Lou W. Enabling public verifiability and data dynamicsfor storage security in cloud computing. In: Proceedings of the 14th Europeansymposium on research in computer security (ESORICS 2009); 2009.

Yekhanin S. Locally decodable codes and private information retrieval schemes [Ph.D. thesis]. MIT; 2007.

Zheng Q, Xu S. Secure and efficient proof of storage with deduplication. In:Proceedings of the second ACM conference on data and application securityand privacy, CODASPY '12. New York, NY, USA: ACM; 2012, p. 1–12. URL ⟨http://doi.acm.org/10.1145/2133601.2133603⟩.

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107 107