Evaluation of information leakage from cloud database servicecs.ucf.edu/~ahmadian/pubs/Leakage...

Evaluation of information leakage from cloud databaseservice

Mohammad Ahmadian and Dan C. MarinescuDepartment of Computer Science University of Central Florida,

Orlando, FL 32816, USAEmail: ahmadian, [email protected]

Abstract

The popularity of Database as a Service (DBaaS) for large scale web and mobile applica-tions elevates data security risk and therefore providing a secure method to protect the databecomes a top priority. The aggregation of large number of databases in cloud, increasesthe risk of sensitive information to be compromised even for the encrypted data. The con-cept of cloud information leakage, resulted from any cross-referencing attacks in the pool ofcloud-hosted databases, is defined and a mitigation method is proposed for NoSQL databasesto leverage leakage-resilient data outsourcing in the presence of cloud internal and externaladversaries. Sensitivity analysis of large-scale database plays a central role in the assessinginformation leakage. The problem with on-line analytical processing which seriously re-stricts usefulness is their excessive long response time. We proposed an Approximate QueryProcessing (AQP) technique to scale to any size databases in fast manner. The conductedextensive experiments on the different datasets, the results and theoretical analytics confirmthe conduciveness of the proposed method.

1 Introduction

A large number of enterprises are currently using cloud Database As A Service (DBaaS)from major Cloud Service Providers (CSP). For instance, the number of websites hosted byAmazon Web Services (AWS) increased from 6.8M in September 2012 to 11.6M in May 2013showing 71% upsurge [1]. Furthermore, a 67% annual growth rate is predicted for DBaaS by2019. Cloud users outsource the storage and management of their databases and expect theirsensitive information to be confidential. Undoubtedly, considering the cloud-related threats,an efficient security scheme is required for the data stored and processed in the cloud.

Typical cloud database services guarantee advanced availability and scalability, but thedata confidentiality is yet an area of interest to be explored further. The database is storedin a third party server which is not assumed to be trusted. The importance of database

1

security and its impact on a large number of individuals are illustrated by the consequencesof two major security breaches [2, 3]. First, in November 2013, approximately 40 millionrecords have been stolen from an unencrypted database used by Target stores. The compro-mised information included personally identifiable information (PII) and credit cards data.According to SEC (Securities and Exchange Commission) report, two months later a cyber-attack of JP Morgan Chase, compromised PII records of 76 million households and 7 millionsmall businesses.

Encryption is a common practice to provide privacy for data and query, but still encrypteddata and query are vulnerable due to information leakage in cloud platform. A database canbe encrypted by data owner before being outsourced to the cloud in such a way that clientqueries can still be processed on the transformed data. Ultimately, the encryption does nothide all information about the encrypted data. For instance, a cloud Malicious Insider (MI)can infer a ciphertext sensitive information through the cross-referencing to another hosteddatabases within the data warehouse. Moreover, collection name, attribute name (or table,field name in RDBMS), number of attributes involved in a query, and the query length oftenreveal sensitive information about the encrypted data. This type of attack on an encrypteddatabase is categorized within the information leakage class.

The term information leakage describes circumstances when sensitive information is in-advertently made available to untrusted entities. In the context of this research, informationleakage is the ability of an attacker to infer sensitive information either through multipledatabase searches or through any statistical analysis of cloud database queries. In otherwords, the information leakage can arise when a combination of pieces of low-risk informa-tion results in deducing the high-risk information.

Now, to exemplify how risky could be the information leakage, we present a motivatingcase happened in August 2006, that AOL 1 released search logs of over 650000 of AOLusers for research purposes. The data includes all users’ searches in course of three months,although for anonymity purpose they changed all user names to random ID numbers. Theinformation leakage found by analyzing all the searches conducted by a user, made him/heruniquely identifiable. Merging leaked data with publicly available datasets revealed evenmore information and more possibilities of threats for users.

Information leakage can occur not only in unencrypted databases but also in encryptedones and can be exploited by both external and insider agents. In a recent work, a methodwas developed through an encrypted datasets accompanied with a secure proxy constructionsuch as SecureNoSQL [4] which guaranteed that a MI would never obtain the decryptionkeys. The proxy in the system, encrypts the queries from the clients and decrypts the queryresponses from the server. The process is completely transparent, without the clients beinginvolved in encryption/decryption operation. The proxy construction assures that the MIcould not explicitly access to the sensitive information; however, there is a risk of informa-tion leakage from ciphered datasets. The MI can exploit the leaked information to organize

1An American global on-line mass media corporation.

2

more extensive attacks to amplify leakage. In on-premises data management systems, a MIhas access to a few databases as a source for organizing data inference attack, whereas thecloud server has access to millions of datasets belonging to a large variety of enterprisesand individuals. With an initial brute force inference attack, the adversary can extract thehidden information.

As we mentioned earlier, a fundamental concern in using DBaaS is leakage of sensitiveinformation by correlation between different databases hosted by the same CSP. As a resultof inference attack, the information leakage about one secret in a database, may leak infor-mation about other secrets in other databases. In this study, the information leakage viewpoints in the cloud infrastructure are investigated. For studying information leakage fromDBaaS model, we choose NoSQL database model with flexible schema. We focused on theinformation leakage as a result of cross-reference within the cloud aggregated data pool.

In the data model of NoSQL, a database is depicted as a collection of documents D =d1, . . . , dn and accordingly a document is modeled with a set of key-value pairs keyi, valuei,each of which represents an attribute of an object. Enforcing partial security mechanismwhich covers only a subset of attributes, may not provide comprehensive protection. Partic-ularly, the protected information could be inferred using low-risk datasets which are hostedin the same cloud. The main building block of the proposed construction is insertion ofdisinformation documents into the collection. In this work, a new information leakage pre-vention method over encrypted and outsourced databases is suggested using disinformationdocument and encryption. The mediator architecture known as secure proxy is designed thatacts as an additional layer between clients and the DBaaS server. The proxy, encrypts/de-crypts both query and response bi-directionally. Secure proxy intercepts the client’s queries,transfers them to the encrypted queries and passes them to the cloud DBaaS server. Thequery is processed in the server and a collection of documents is returned to the proxy. Thereturned collection is basically a combination of valid and forged documents. Eventually, theproxy decrypts the collection and filters out the fake documents and forwards the desireddocument to the user’s application. To elude huge overhead of disinformation documentpadding in the dataset we introduce Selective Disinformation Document Padding (SDDP)in which data analysis is used to compose and insert disinformation correspondingly. More-over, for improving query processing time over augmented ciphertext dataset we investigateencrypted data indexing features.

1.1 System deployment eco-system

The proposed solution consist of two parts: i) a secure proxy ii) cloud DBaaS for NoSQLdatabase service. In the scheme there are three interested parties, the data owner, thecloud base NoSQL database service and database query issuer users which can be a businesspartners of the data owner. We assumes all participants are interacting in a public cloudeco-system. The proposed scheme is easily adaptable to the hybrid and community cloudenvironment where the security risk is lower than the public cloud.

3

1.2 Problem statement

Information linkablity (a property of being linkable) is a distinctive aspect of data thatmakes it possible to link individual pieces of information from different sources. The clouddatabase service is a data warehouse, consists of thousands of hosted databases from variousorganizations. Therefore, information linkablity poses a new type of threat to the users pri-vacy. The potential attacker could link low-risk pieces of information to extract the sensitiveinformation about an entity. For example, a persons connected metro card to the debit cardfor auto-refill purpose might lead to detect a correlation which could reveal his/her employerinformation through the direct deposit. Exposing some bank information could also disclosesome further information such as health records. Potentially, a link between metro card andprivate information could be detected and learned.

In NoSQL database, all features of an entity are presented by attributes, this informationlinkablity occurs in form of attribute correlation amongst multiple databases. We suggest toselectively insert documents with false information in the dataset according to the statisticalanalytics in order to deliberately expand the volume of the information leakage to the un-trusted database service provider to protect the real sensitive information. At the end, wefilter out the fake documents and deliver the correct documents to the corresponding client.

1.3 Contribution of this work

The aim of this research work is to extend the knowledge of information leakage in the newcloud services such as DBaaS. The main contributions of this paper are as follows:

1. We show a method to quantify information leakage as a result of explicit attribute corre-lations in the cloud data warehouse.

2. We present a method to quantify information leakage due to implicit correlation betweenattributes.

3. A selective disinformation document padding method is proposed to manage informationleakage with limited overhead.

4. We introduce a fast leakage assessment and parameter extraction algorithm to scale forvery large databases by using approximate query processing.

1.4 Related work

The query processing on an outsourced database has been investigated through a large bodyof research studies [5, 4, 6]. According to these studies a common practice in query processingis to use cryptosystems for database encryption before outsourcing to the service providers.Intuitively, the client’s queries are also encrypted in the same way as data. For instance,CryptDB is a prominent schema which is used in SQL-based databases [5] and SecureNoSQLis proposed for processing queries over encrypted NoSQL databases in the public cloud [4].The key advantage of these approaches which makes them efficient and plausible options, is

4

the fact that they do not modify the database server. In particular, the encrypted data isprocessed in the same way as the plaintext data. This feature brings all the functionality ofthe database technology such as multi-layer indexing, cache and file management to supporta fast operation and optimization. However, the major drawback of these approaches isinformation leakage as a result of application of property preserving crypto-systems.

Homomorphic Encryption (HE) scheme allows computations to be performed on cipher-text, therefore operation result on the encrypted data decrypts to result of the same operationon the corresponding plaintext. In cloud computing, HE enables data owners to encrypt andoutsource their private data to the CSP for processing without compromising the privacy.Order Preserving Encryption (OPE) scheme is a deterministic cryptosystem in which theencryption function preserves the ordering of input plaintext data in the ciphertext. In cloudDBaaS, the aggregate queries such as comparison, min, and max can be executed on theencrypted data with an OPE scheme. While HE is theoretically feasible [7], but impracticaldue to the enormous overhead involved, OPE offers less protection and leaks critical infor-mation about the plaintext data [8, 9]. An acceptable level of security can be achieved ona searchable encryption method such as Oblivious RAM (ORAM) [10, 11]. However, themajor problem of ORAM is its efficiency and the high computational cost as well as theexcessive communication costs between clients and server [12].

Inference attacks against CryptDB was first proposed by Naveed et al. [13]. A series ofattacks to the encrypted databases were designed under CryptDB, especially those encryptedwith property preserving crypto-systems. Evidently, deterministic and OPE cryptosystemsleak the critical information such as frequency and order of the original data, and thereforeenable the attackers to extract sensitive information. The literature on managing informa-tion leakage in the cloud shows a variety of approaches. Two major research approaches areincluding i) restriction on the sequence of queries or query processing time, ii) insertion offake documents.

A quantitative characterization of correlation-based information leakage, formulated throughthe capacity of a (n, q)− leakage channel is studied using restriction on the number of queryor query processing time [14]. The n-channel capacity is defined as the maximum possibledepth of query-able information for any user. The capacity of a n-leakage channel, is definedas the probability of accessing a specific sensitive document in n trials. In this approach,a group of documents form a chain with track of key-value pairs, and attacker can locatethe head of the chain. By following the footprints, the rest of sensitive information can beaccessed. The attacker is not only constrained by the number of trials, but also by the timenecessary to achieve his objectives.

Data privacy accompanied with entity resolution are used to provide a solution for infor-mation leakage [15] . This method is considered as one of the first researches on the subjectof insertion of fake documents. As more information about an entity is revealed into thedifferent service providers in the on-line world, there is a high risk of potential attacks to linkthose pieces of information to pose significant threat against privacy. Information leakage

5

is measured in terms of how much of the complete information about an entity is availableto an adversary. The fake information is inserted into the valid information to mislead theattacker with multiple values for an attribute. Apparently, this method drastically increasesthe number of total documents and the communication cost; however, using multi-level in-dexing, the negative effect on the query processing time is neutralized.

Basically, in the most on-line systems, the huge volume of the operational database andother limitations make it infeasible to analyze the level of information leakage between dataelements in real-time manner. Using random sampling for approximate measurement is oneof the well-known solutions that dramatically cuts analysis time, especially, if sample is smallenough to fit in the main memory. However, approximate measurements based on sampledata has discrepancy with real database measurements. A new approach was also developedfor Sampling-based Approximate Query Processing (S-AQP) with guaranteed accuracy thatprovides bounds on the error caused by sampling [16, 17].

Assessing the leakage from large datasets is a kind of problem that can tolerate somedegree of inaccuracy. Therefore, we adopt the same approach in this work. In this work, themulti-level indexing and sampling-based approximate query processing is used as our firstapproach to avoid high latency of query processing. The second approach that we introducedin this work is selective disinformation document insertion, which inserts disinformationdocument based on the statistical analytics of data. Thus, the overhead of fake documentinsertion will be decreased along with a lower computational cost accordingly.

1.5 Paper Organization

The paper is organized as follows: In §2 we review system mode and assumptions andthreat model of cloud database. Moreover, basic notions of security as well as strength andweakness of some cryptosystems that allow to perform a set of operations on the ciphertextare discussed. Risk factors of untrusted outsourced database service in cloud environmentare explored in §3 and an mitigations are introduced for the major factors. Next, a queryintegrity verification method is presented. The information leakage due to explicit andimplicit correlation of data elements and the corresponding solutions are addressed In §4.To tackle scalability problem of sensitivity analysis in the very large database system a novelsolution based on approximate query processing is presented in §5. The detailed descriptionof experiments are discussed next, as well. Finally this paper is concluded in §6.

2 System model and assumptions

A crucial requirement to build a secure system is proactive threat analysis followed by asystematic conversion of identified threats into the system-wide requirements. This sectionintroduces a rigorous model for utilizing threat modeling in building secure and leakageresilient cloud database service.

6

2.1 Threat model

Threats of cloud computing can be analyzed from multiple viewpoints, adversarial views, sys-tem security and threat determination. This study has focused on the adversarial prospectivewhich is a holistic multifaceted procedure considering the whole system end-to-end security.The adversarial threat analysis starts with thinking like an attacker and continues to preparethe corresponding countermeasure. There are two classes of threats, known as external andinternal attackers (defined below), identified in the model. These two classes of threats areaddressed in our proposed solution.

External attacker: An attacker from the outside of the cloud environment may be ableto obtain unauthorized access to the data by applying techniques or tools to monitor thecommunication between the clients and the cloud servers. External attackers, in most cases,face a more complex task, since they need to bypass firewalls, intrusion detection systems,and other defensive setups without any authorization.

Cloud malicious insiders(MIs): A major side effect of outsourcing database in thecloud is the risk of unauthorized access by the cloud MIs which are referred as MIs. Aninternal attacker has the primary advantages of being inside the cloud protected area andhaving an access to the resources. More specifically, an employee or a contractor of a CSPhas access to the servers, software, hardware and user’s data and among them there maybe MIs. Therefore, the efforts for data protection provided by CSP could be bypassed byMIs. An intruder who gains access to the internal computing resources of cloud by illegallyobtaining the credentials of a valid user is also categorized into MI threats. Major risk factorsassociated with cloud MIs are violation in data confidentiality and integrity and/or leakedinformation exploitation. All the aforementioned factors and the corresponding mitigationsare discussed in Section 3.

2.2 Cryptosystems for outsourced data store

Data in the cloud environment can have three states: store, transit, or process. Therefore,in an effort to maintain security and privacy, any comprehensive data security mechanismmust account for the protection criteria for data in any of these three states. The commu-nication channels can be secured by using the standard HTTP over Secure Socket Layer(SSL) communication protocol. Most CSPs provide an API for the web service that enablesdevelopers to use both the standard HTTP and the secure version of the HTTPS protocol.The security requirements of data in transit state can be fully satisfied by using HTTPSfor communication with cloud. In addition, the endpoint authentication feature of the SSLprotocol makes it possible to ensure that the clients are communicating with an authenticcloud server.The basic idea is to encrypt the data before uploading it to CSP. However, the data shouldbe decrypted by the cloud server before getting processed. In other words, the data ownershould disclose decryption key to the server in order to decrypt the data before performingany required operation. The problem is when the decryption key is compromised, the dataconfidentiality would be affected. Therefore, in the cloud computing model, a new set of

7

cryptosystems is required. A cloud developer is responsible to ensure that the data in thecloud storage is protected by authentication, based on user’s credentials. However, highlysensitive data is at the risk of illegitimate access by a MI. Thus, the data should be encryptedbefore being uploaded to the cloud. Any type of encryption can be used, since there is norequired data format for a cloud storage. A brief description of general types of cryptosys-tems are introduced in the following.

As it was mentioned before, the encryption schemes that support operations on encrypteddata are denoted as Homomorphic Encryption which have a very wide range of applicationsin cloud computing. A Fully Homomorphic Encryption (FHE) scheme is a cryptosystemthat allows evaluation of arbitrary complex operations on the encrypted data.

Random (RND): By applying a RND type encryption scheme, a message is coupled witha key k and a random Initial Vector (IV). This scheme is called non-deterministic, sinceencryption of the same message with the same key yields different ciphertext. This random-ness provides the highest level of security. Randomness property is achievable with differentencryption algorithms. Advanced Encryption Standard (AES) with Cipher Block Chaining(CBC) mode is used for RND encryption. AES is a symmetric block cipher algorithm witha key size of 128,192 or 256 bits and with a block size of 128 bits. RND type schemesare semantically secure against chosen plaintext attacks and hides all kinds of informationabout ciphertext. As a result, RND scheme does not allow any efficient computation on theciphertext. Equation 1 describes the encryption and decryption of a block cipher in CBCmode.

for j = 2 . . . n;

C1 = Ek(P1 ⊕ IV ), P1 = IV ⊕Dk(C1)

Cj = Ek(Pj ⊕ Cj−1), Pj = Cj−1 ⊕Dk(Cj)(1)

Deterministic (DET): A DET encryption scheme is a cryptosystem which always pro-duces the same ciphertext for an equal pair of given plaintext and key. Block ciphers inElectronic Code Book (ECB) mode, with a constant initialization vector are deterministic(DET). Deterministic encryption scheme preserves equality, therefore, the frequencies infor-mation of the searched keywords leaks to the third party. AES scheme in ECB mode isused for DET encryption over document-oriented NoSQL databases. DET scheme enablesserver to process pipeline aggregation stages such as group, count, retrieving distinct valuesand equality match 2 on the fields within an embedded document. The embedded documentcan maintain the link with the primary document through application of DET encryption.Equation 2 describes the encryption and decryption operation in a DET.

for j = 1, . . . , n; Cj = Ek(Pj); Pj = Dk(Cj) (2)

Where: Ek is the encryption algorithm, Dk is the Decryption algorithm, k is the secret key, P is a block of plaintext data and

C is a block of ciphered data.

2Equality matches over specific common fields in an embedded document will select documents in thecollection where the embedded document contains the specified fields with the specified values.

8

Order-Preserving Encryption (OPE): OPE projects the order relation between plain-text data elements and their ciphertext values. OPE leaks the order of ciphertext, so itsupports a lower degree of security. Even in Modular Order-Preserving Encryption (MOPE)[18] which is an extension to the basic OPE for security improvement, there is informationleakage. An efficient inequality comparisons on the encrypted data elements can be per-formed by applying OPE which supports range queries, comparison, min and max on theciphertext. We use the algorithm introduced in [9] and implemented in [8] for cloud envi-ronment. Equation 3 shows the preservation of order relation of plaintext in the ciphertext.

∀x, y | x, y ∈ Data Domain x < y =⇒ OPEk(x) < OPEk(y) (3)

Where OPEk is key-based OPE.

Additive Homomorphic Encryption (AHOM): AHOM is a scheme that allows theserver to conduct computations on ciphertext with the final result that is decrypted at theproxy. In spite of sustained research efforts [19, 20] of the Fully Homomorphic Encryption(FHE), there is no efficient FHE, except for some limited operations. AHOM is formulated inEquation 4. It should be noted that m1,m2 are messages to be encrypted where m1,m2 ∈ Zn.r1, r2 are randomly selected and r1, r2 ∈ Z∗n. In other words, the product of two ciphertextsdecrypt to the sum of their corresponding plaintexts.

Dk(Ek(m1, r1)× Ek(m2, r2) mod n2) = m1 +m2 (mod n) (4)

In this work whenever the summation and multiplication operations are needed over en-crypted data, we applied Paillier [21] scheme which is an instance of AHOM. Encryptioncan be applied for datasets at various levels of granularity 3 from high granularity data, likeatomic level data element to low level granularity which is basically the aggregated form ofatomic data items. For data store encryption, it is conceivable to have different encryptiongranularity levels according to the corresponding data granularity. The higher level of en-cryption granularity the higher the information leakage. For example, encryption of a singleattribute leaks frequency information, while encryption of whole document and collection asa single unit leaks less information. In the current work, we apply a variety of well-suitedencryption granularity for different data granularity level.

2.3 Preliminaries

It has been extensively crucial for a data owner to provide query processing feature over out-sourced encrypted dataset with minimum information leakage to the clients. It is essentialto mention, in cloud computing model, an optimal security without any kind of leakage isalmost infeasible due to imposing a significant ploy-logarithmic computation and commu-nication overhead. Clouds are serving as a shared platform for hosting databases of largenumber of merchants, service providers, social networks and organizations. As a result ofthis aggregation, the risk of information leakage to cloud insiders through cross-referencingattacks becomes remarkable.The leaked information iteratively can be exploited to organize more severe attacks to obtain

3Granularity indicates the level of detail in data stores, the more detail, the higher granularity level.

9

sensitive information. Accordingly, the information confidentiality violation in cloud DBaaSis a plausible risk that requires a holistic plan to acknowledge major sources of informa-tion leakage with deliberation of cloud threat model. The required concepts in this regard,including access pattern, Query pattern and Attributes class are defined as follows.

Definition 1 (Access Pattern)The physical location of a document containing the trapdoor keyword in the collection is

known as access pattern. In particular, for a given query q and dataset D to an untrustedserver (adversary), any query response always leaks access pattern information to the adver-sary. The goal is to process query q = w,D which contains keyword w over the datasetcontaining n documents D = d1, . . . , dn. In an encrypted dataset, the adversary is not ableto learn w, but simply finds out the set of documents that contains w and can learn aboutthe access pattern. For instance, in course of specific time, some encrypted documents inthe collection were retrieved more frequently than the rest of documents, which reveals accesspattern to the server.

Definition 2 (Query Pattern)Query pattern is a kind of information that server can obtain from aggregation of queries

during an operational time, e.g. a CSP identifies the documents that are being queried morefrequently from history of queries.

Definition 3 (Attributes Classification)Attributes of a document are categorized into three classes: (1) Identifier attribute, which

allows us to uniquely identify a single document in the collection, e.g. social security number,phone number and email address. (2) Semi-identifier attributes which are not able to uniquelyidentify documents, but are collectively able to distinguish a document, e.g. combination ofname, department name, gender and age can identify a person uniquely. (3) Feature attributeexpresses a characteristic of an object by giving sensitive information about it.

3 Risk factors of malicious insiders

The risk posed by a potential MI has been identified as the top most devastating threatfor the cloud ecosystem. Basically, a user or a software with authorized access to the cloudresources can be considered a MI. For preparation of an effective defense mechanism, first,with risk analysis the significance and scope of each threats posed by MIs should be char-acterized. Next, the mitigation solution proposed for each major risks. Among the differentrisk factors of a MI, the confidentiality and integrity violation of data are the most crucialsecurity concerns. Evidently, the security mechanism provided by a CSP are vulnerable tothe risk posed by a MI. Besides, using classic cryptographic schemes nullifies benefits ofcloud computing and most of them leak critical information about underlying data. Thereare little initiatives designated to address the information leakage from encrypted data. Therisk factors and the corresponding solutions as well as the advantages and disadvantages ofthe proposed solution are summarized in Figure1.

10

Confidentialityviolation

Integrity violation

Information leakage

Cryptosystems:AHOM, OPE,

DET, RND

Query verification

1-Query tracking2-Selective insertion

of disinformation

1-Confidentiality2-Full control

over data

Data authenticity

1-Leakage limitation2-Indistinguishableinfo. and disinfo

1-Limitedoperations

2-Computation cost

1-Computation cost2-Token overhead

1-Data overhead2-Filter & ver-ification cost

Risk factor Mitigation Addvantage Disadvantage

Figure 1: Risk factors of malicious insiders in the cloud DBaaS and the corresponding solutionswith advantages and disadvantages

The risk vector of MI is defined and modeled in the following subsection.

3.1 Risk model

Our approach to model a MI starts with identifying components and the associated risksas result of actions of a MI. Clearly, a MI in DBaaS infrastructure, has easier access andgreater window of opportunity to does much severe harm than external attacker. A MI canbypass all internal protection mechanism and poses serious risks to perform confidentiality,integrity and inference violation. The list of possible actions are denoted as A = C, I,Ψrespectively. We assume that there are n collections are stored in the data warehouse ofDBaaS rephrased as: WDBaaS = D1, . . . , Dn.

A MI has read/write access in the data warehouse and activity log files denoted with L. LetR present the key resources in the DBaaS R = WDBaaS, L. A MI interested in sensitiveinformation about an entity T as a target. Information about T is stored in the collectionDi in form of document-store or set of 〈key, value〉 pair(s). The MI has one initial documentstored in the one collection Di, and his goal is to obtain sensitive attributes of T. We analyzethe associated risk factors of cloud DBaaS as a result of actions of a MI.

Confidentiality violation risk

A MI misuses the existing privileges to gain further access to sensitive information withoutany trace of intrusion. Internal attacks are very hard to detect because the users are au-thenticated on the domain.Mitigation: Cryptographic schemes should be considered by the data owner to encryptdata before outsourcing it to the cloud. Meanwhile, processing over encrypted data withoutdecryption is essential requirement that put restriction to cryptographic schemes selection.

11

A MI can extract an encrypted sensitive attribute A = 〈Ek(key), Ek(value)〉 about T storedin collection Di by iteratively calling the correlation function to the rest of collectionsΨ(Di, Dj). One single element can be classified in different level of importance in vari-ety of organizations. The MI in DBaaS can exploit this difference in the aggregated datapool to violate confidentiality of sensitive attributes. It can be shown that two collections Di

and Dj that initially do not share any documents, but later on it turns out to be they bothhave common documents that are describing the same entity T. Therefore, the MI main-tains a list of the explored collections, and we have a binomial distribution with parametersn ∈ N as a number of collections and p ∈ [0, 1] as success probability of each collection. Theproblem is sampling with replacement and the probability of finding k sensitive documentsin n collections of the cloud warehouse WDBaaS. The reason for sampling with replacementis, If Ψ(Dr, Ds) == false, but due to discovery of a new correlation Ψ(Ds, Dt) == truethe new extracted attribute(s) can change Ψ(Dr, Ds) to true. The function defining discreteprobability distribution of k documents from exploring n collections is:

f(k;n, p) =

(n

k

)pk(1− p)n−k (5)

The mean value and the variance of f(k;n, p) are

median = n× p σf = n× p(1− p). (6)

The probability of extraction one sensitive attribute from the n collections is:

f(1;n, p) = n× p× (1− p)n−1 (7)

The probability distribution of successful extraction of zero document related to the Tfrom entire WDBaaS is

f(0;n, p) = (1− p)n. (8)

Integrity violation (active attack) risk

An integrity violation can happen in a cloud data store by unauthorized modification ofdata.Mitigation: The integrity verification mechanism ensures data is only modified by an au-thorized valid user, and identifies integrity violation done by an MI. We introduced queryintegrity verification tamper-resistant algorithm which is built on Message authenticationcodes (MAC). By appending a new attribute to each document denoted as eTag, containinga value which is keyed hash value of whole document. Detail of this method is discussed insubsection 3.2.

Sensitive documents in the cloud DBaaS are in risk of unauthorized modification by a MI.Maintaining data integrity in the stored collections is a crucial necessity in the outsourceddatasets. Detection of unauthorized modification can be done by Message AuthenticationCode (MAC) over sensitive documents. Similar to digital signature, MAC appends an au-thentication eTag to any document. The building block of MAC is a keyed-hash function to

12

generation/verification of eTag. A data owner generates 〈eTag, Ek(di)〉 and appends as a newattribute to all given documents, later on the verification process grantees the authenticityand integrity of documents as presented in Equation 9.

V erification :

eTag′= MACk(document);

if(eTag == eTag

′) =⇒ Document is valid.

else Document is invalid.

(9)

Information leakage risk

Encrypted databases are prone to information leakage, for instance deterministic and OPEcryptographic schemes always leak frequency and the order of plaintext data respectively.Subsequently, with a simple frequency analysis or sorting attack, MIs are able to obtainsensitive information. Later on, with an inference attack they combine leaked data withinformation in the other databases in the same cloud DBaaS or a public domain to hack thedatabase.

Mitigation: We propose a method of Selective Disinformation Document Padding(SPDP) to hinder information extraction, while avoiding overhead of disinformation doc-ument padding. SPDP uses diagnostic metrics such as precision, recall and statistical an-alytics to periodically probe the dataset for the existence of the possible explicit and/orimplicit correlations between data elements. Thereafter, the detected leakage is comparedto the maximum acceptable level of leakage. If the result exceeds the maximum level (de-termined by the data owner), SPDP generates a certain number of fake documents to beinserted into the dataset in order to decrease the amount of leaked information. The num-ber of generated disinformation documents depends on the significance of the leakage metrics.

The flexibility of NoSQL databases in terms of having different number of attributescorresponding to each document, helps us to create light-weight documents in order tomitigate data overhead in the DBaaS server. All the original and disinformation documentsare encrypted and semantically indistinguishable from the MIs point of view. The ciphertextdataset is processed by a standard cloud DBaaS, in a similar way to the plaintext data. Inaddition, the indexes in database are used to improve query processing time by shrinkingthe search space. Correspondingly, by indexing the encrypted data elements, a remarkableimprovement is achievable in query processing time.

The dynamic nature of a dataset as a persistent storage necessitate to have basic oper-ations to create, read, update and delete (known as CRUD). The augmented dataset (withdisinformation documents), naturally are subjected to CRUD operations. The create andread operations act on the augmented dataset in a similar way to a normal dataset, while theupdate/delete operations on the original documents should be projected on the correspond-ing disinformation documents. For the performance purpose, the update/delete operationson the related disinformation documents, can be processed on demand or postponed tothe next period of the data analysis. Furthermore, the garbage collector service, run by theproxy, can be developed to delete any unrelated disinformation from the dataset. The designand development of such a service is beyond the scope of the current work.

13

The probability for the one attribute to be extracted from n collections is given byEquation 7. For leakage minimization purpose, at the any leakage points V disinformationdocuments generated and inserted to the collection. The leakage point are detectable by usingthe two methods that are introduced in this work. In particular, insertion of V disinformationdocuments creates V branches from each one of documents to the other correlated documents.Ultimately, the result is a full tree-like structure with V + 1 branch factor which requiresexponentially computation from MI to extract number of attributes. The number of branchesin proportional to height h is:

N = 1 + V + V2 + · · ·+ Vh =Vh+1 − 1

V − 1. (10)

Thus, the new probability distribution function is

f(k;n, p) =V∏i=0

(N

k

)p

ki (1− p)N−k. (11)

3.2 Query integrity verification

DBaaS could be subject of passive or active attack by a MI. In passive attack an attackerattempts to uses all tools and credentials to gain information about a specific target withoutchanging the data. Particularly, passive attacks are designed to violate confidentiality ofthe targeted entity. However, in an active attack, the MI exploits cloud DBaaS to conductunauthorized data modification on the target. In other word, the goal of an active attack isintegrity violation.

To address the query integrity verification, we propose a method that effectively identifiesand filters out the invalid documents from query response at the proxy side. First, the dataowner adds a new attribute to the original documents, denoted as eTag. Next, a single tokenis required to identify valid documents, i.e. a 32-byte secret keyword. In order to achieve dataintegrity and eliminate the risk of unauthorized data modification, we considered a documentdigest which is computable by a cryptographic hash function. In the next step, the dataowner needs to encrypt the concatenation of token and digest with RND cryptosystem foreach of the documents to store the output as a value for eTag. Now, it is time to create Vnumber of disinformation documents with minimum precision and recall value to be insertedinto the dataset. For instance, for a collection containing 106 documents and V = 10, weneed to insert 10 fake documents per each original document in the collection. Accordingto semantic security principle, the original and fake documents should be indistinguishable.Thus, the same steps for creation of eTag are needed to be conducted for the fake documentswith a new token for identification purpose.

In particular, each valid document is an equivalent class for V documents that havediverse value for semi-identifier attributes to prevent identity disclosure. Because of thisfeature this method is denoted as V- anonymity. Finally, the attributes of real and fakedocuments are encrypted according to the security plan and forwarded to the DBaaS in thecloud. Figure 2 illustrates the process of eTag calculation for the documents.

14

Info doc

Key1 V alue1

... ...

Keyn V aluen

Disinfo doc

Key1 V alue1

... ...

Keyn V aluen

Digest = HashDocumenteTag : EkToken‖Digest

Encryption

Ek(Key1) : Ek(V alue1)..Ek(Keyn) : Ek(V aluen)

Encrypted doc

eTag value

Ek(Key1) Ek(V alue1)

... ...

Ek(Keyn) Ek(V aluen)

Encrypted doc

eTag value

Ek(Key1) Ek(V alue1)

... ...

Ek(Keyn) Ek(V aluen)

Figure 2: The high level diagram of query integrity verification technique.

The proxy decrypts eTag value with the decryption key and verifies the tag of origi-nal documents from the query response and filters out the disinformation documents. TheMAC are used to protect against any unauthorized modifications of documents. With themechanism of digital signature that implemented in eTag, the proxy verifies the authenticityof whole document with recalculation of document digest and verification with the digest.The digital signature of the documents adds a small overhead to the documents, howeverit guarantees the document never gets modified by cloud insiders. Intuitively, eTag is notinvolved in a query processing and this attribute will be used for the internal structure forproviding integrity of data store.

In order to minimize the information leakage, we propose to encrypt attributes which arelow level granularity with RND cryptosystems. Equation 12 outlines RND (non-deterministic)encryption function constructed from a deterministic one. First we associate a unique fixedlength random number r, with each document. With application of eTag, the risk of activeattack and unauthorized modification by third party is detectable in the verification phasein the proxy.

Ek(x) = E ′k(x || r) (12)

Where:k is the encryption keyE′ is a DET encryption. r is a random number.

The crypto-hash functions are building blocks of the introduced query integrity verifica-tion algorithm, and therefore, having an efficient hash function leads to low latency integrityverification. We examined the performance of four popular hash functions according to thevariety of document size. The result is displayed in Figure 3. Considering the performanceand security metrics, we select SHA1 over other hash functions including MD5, RIPEMDand SHA256 to be used in eTag algorithm. As it can be observed from Figure 3, the reason

15

of selection of SHA1 is its high performance (speed) at different input documents sizes.

16 64 64 256 256 1024 1024 8192 8192

0

1 · 105

2 · 105

3 · 105

4 · 105

5 · 105

6 · 105

7 · 105

8 · 105

9 · 105

Document size (Bytes)

Sp

eed

(KB

/sec

ond)

MD5SHA1

RIPEMDSHA256

Figure 3: Performance of four popular cryptographic hash functions in respect to document size.

The information leakage due to attribute correlation is investigated in the next section.

4 DBaaS information leakage management

In databases, there are often unknown correlations among the heterogeneous data types be-yond defined dependencies such as primary or foreign keys relationship between attributes.From the data security point of view, since the analysis of the hidden correlations revealssensitive information, it is desired to minimize the correlation leakage especially in an un-trusted outsourced server. However, it is a demanding goal because there is no well-knownmeasure to quantify the attributes correlations.

We propose two quantifying methods for attributes correlation analysis: First, the ex-plicit correlation can be formed due to the exact match relation between the attributes. Theadvised quantification function operates based on tracking and matching the occurrence ofexact key and value matching in different databases, and associates a non-negative real valuefor any correlation. The assigned number represents the level of the leaked information frompooling any given pair of documents. We adopt the recall and precision concepts to quantifyexact match correlation leakage.

Statistical properties of attributes (even heterogeneous data types) within documentsgenerates an unintended implicit correlation that leaks private information. The detectionof implicit information dependences are very complicated, and requires deep statistical analy-sis of attributes’ properties. For instance, often the correlations between energy consumptionand area of a building or the number of occupants leaks valuable information. As a secondclarifying example, there is a correlation between illness and the corresponding medicationthat is prescribed for an individual. The quantification of correlation as a result of statisticalproperty is addressed in our second approach.

16

Conceptually, the data warehouse used for DBaaS warehouse can be considered as a newhigher layer coarse-grained storage that operates in multiple enterprise level. DBaaS hostsvast number of databases belong to enterprises, and each dataset contains a set of documents.In this work, the DBaaS repository is represented by WDBaaS which is a set of databases,and information leakage caused by the attribute correlation is formulated as follows:

WDBaaS =⟨D1, . . . , Dn

⟩Each database Dı , 1 6 ı 6 n consists of a set of arbitrary number m of documents.

Dı =d1, . . . , dm

Eventually, documents are defined as a set of attributes which are appointed as the fine-grained level of cloud data warehouse. Each document d , 1 6 6 m are comprised of anarbitrary number l attributes which are basically built up with a key-value pair 〈key, value〉.

d =A1, . . . , Al

The major types of attribute correlation leakage are discussed in the following subsectionsand the corresponding sub-optimal solutions to minimize the leakage are proposed.

4.1 Explicit correlation

An attribute correlation between any two collections (databases) is described as a frequentoccurrence of the attributes values in the different collections that are stored in the samecloud data warehouse. A correlation function assigns a non-negative real number value as cor-relation score to a group of datasets. The correlation function, also denoted as Ψε(DP , DQ),extracts the leakage between two databases Dp and Dq, that indicates the amount of leakedinformation as a result of existing correlations. Equation 13 formulates correlation function.

Ψ(DP ,DQ) :∀ dp ∈ DP ∧ ∀ dq ∈ DQ if (µ(dp, dq) == True) =⇒ L = dp ∆ dq (13)

Where L is the leakage factor and the operator ∆ is defined as dp ∆ dq = (dp \dq)∪ (dq \dp).

The feasibility function µ determines if a given pair of documents are capable of beingmerged using attributes classification. In other words, the function returns true for a pair ofdocuments that share any number of identifier attributes (one or more). Also, for multiplesemi-identifier attributes that can collectively identify an entity in the scope of dataset, thereturned value of the feasibility function is true. Equation 14 outlines this function.

µ(dp, dq) =

True

iff ∃Ar ∈ dp ∧ ∃As ∈ dq | [(Ar.key == As.key) ∧ (Ar.value == As.value)]such that Ar, As are indentifier attributes of dp, dp

False Otherwise

(14)

17

The leakage factor L represents volume of leaked information from two arbitrary documentsdp, dq due to the existence of one or more identifier in both documents. Note that thesedocuments can be chosen from one or two different collections. To manage the amount ofinformation leakage, a fundamental requirement is to introduce measurement metrics. Thefirst metric we introduced, is basically the improved version of the metric that was introducedby Whang et al.[15]. The well-known precision and recall concepts were adopted from dataretrieval to measure the information leakage. Precision is defined as the ratio of the numberof attributes in a document d to the number of attributes in the reference document. InWhang et als method, an equal weight has been assigned to all the attributes, while, weimproved this model by varying the weight of attributes according to their type.For accurate quantifying of information leakage, a notion Ω is defined as a quantitative met-ric for representing the value of information stored in a document d. This metric is, thesummation 0 ≤ ωı ≤ 1 which is associated for any attribute Aı in the document, obtained

by Ω =n∑ı=1

ωı. The higher magnitude of the corresponding ω value reflects the importance of

an attribute Aı = key, value. Consequently, for any document d the value determines thevalue of all information in the document. The highest possible value ω = 1 is assigned to theidentifier attributes. The weight value for a group of m semi-identifier attributes that collec-tively identifying an entity, is assigned to be ω = 1

m. Moreover, a feature attribute receives

a very small ωı value. Equation 15 represents the weight calculation for any given document.

Ωd = α× n+ β ×m+ γ × p Such that : α β γ ≥ 0 (15)

Where n is the number of identifier attributes with weight α, m is the number of semi-identifier attributes with weight β, and

p is the number of feature attributes with weight γ.

As a result of dynamic schema employed for data model, in the context of the NoSQLdatabase, the documents related to the same object are allowed to have various number ofattributes. However, a full list of attributes is necessary to create a reference document.Therefore, we define a logical operator δ denoted as Super Document, that aggregates allattributes related to an entity in the scope of collection to create the reference documentrequired for precision and recall. Apparently, the cloud insider has a wide view scope interms of accessible data for δ function. Having comprehensive super document is impractical;however, it can be called for any subset of the selected databases (one or more). To constructa super document of an entity from set of collections, first the super document is initializedwith a document that describes the entity (di). Second, the other documents (dj) are scannedto extract any undetected attributes of the entity, L(δı, d).The Equation 16 demonstratesthe notion of super document. The construction of a super document from multiple datacollections is described in Algorithm1 (see AppendixA).

δı = dı

∀d ∈ D, δı = δı ∪ L(δı, d)(16)

18

Measurement of information leakage in the document level with utilization of δ function(super document) and ωi value, for precision and recall metrics is straightforward compu-tation. The process to figure out δ function among the multiple collections that are hostedby the same cloud DBaaS is outlined in the Algorithm 1. If a new attribute is extracted itwill be appended to the super document, then a back track is needed in order to check fornew paths that already were closed. The search space raises exponentially according to thenumber of attributes.Example 1. Consider five documents from different databases selected from the DBaaSwarehouse, belonging to two individuals. The goal is to extract leaked information usingattribute correlation. This case exemplifies Algorithm 1 presented in Appendix A and table1 shows the list of attributes for this example with hypothetical weight.

Table 1: Weights of attributes

Attribute Type ω

Zip Semi-identifier 0.3Address Semi-identifier 0.5Phone identifier 1.0Account identifier 1.0Name Semi-identifier 0.6Age Feature 0.1Income Feature 0.1SSN identifier 1.0Email identifier 1.0

d1 = zip : 456, address : “2512 Uni. NY ”, phone : 111d2 = ssn : 123, age : 33, account : 222d3 = name : “Kate Jones”, age : 30, address : “abc”, email : “[email protected]”d4 = name : “Mike Smith”, income : 70k, ssn : 123, phone : 111d5 = name : “Kate Jones”, email : “[email protected]”, ssn : 777

In the above example, δMike Smith, δKate Jones are formed by utilizing Algorithm 1. Theextractable information through the correlation are as follows:

µ(d1, d2) = FALSE δ = d1 L = µ(d1, d3) = FALSE δ = d1 L = µ(d1, d4) = TRUE δ = d1, d4 L = name, income, ssn

Back Track

µ(d1, d2) = TRUE δ = d1, d4, d2 L = name, income, ssn, age, accountµ(d1, d3) = FALSE δ = d1, d4, d2 L = name, income, ssn, age, accountµ(d1, d5) = FALSE δ = d1, d4, d2 L = name, income, ssn, age, accountδMike Smith= zip,address,phone,name,income,ssn,age,account

19

Similarly, the super document for “Kate Jones” is : δKate Jones = name, age, address, email, ssn

The recall value for these documents can be calculated:Rd1 = 1.7

5.4≈ 0.31 , Rd2 = 2.1

5.4≈ 0.39 , Rd3 = 2.2

3.2= 0.69, Rd4 = 2.6

5.4= 0.48 and

Rd5 = 2.53.2

= 0.78.In the example above, the disinformation documents (ρ1, .., ρ6,) with low precision are cre-ated as bellow. After inserting them into the original collections, the cloud insider has doublevalues for each sensitive attributes. Therefore, the real value for attributes cannot be ex-tracted with high confidence.

ρ1 = zip : 654, address : “1500PlaceAZ”, income : 60k, ssn : 321,phone : 111ρ2 = ssn : 321, age : 43, address : “abcAZ”,phone : 876ρ3 = ssn : 321, account : 444

Similarly for “Kate Jones” we have :

ρ4 = age : 20, address : “efd”, email : “[email protected]”ρ5 = name : “ClaireShepard”, ssn : 543, email : “[email protected]”ρ6 = ssn : 543, email : “[email protected]”

Pρ1 = 12.9

= 0.34; Pρ2 = 02.6

= 0; Pρ3 = 02

= 0 ; Pρ4 = 11.6≈ 0.62 ; Pρ5 = 1

2.5≈ 0.4;

Pρ6 = 02

= 0

In short, it is more desirable to have low recall and precision values which reflects moreuncertainty, and consequently less information leakage. The precision and recall are suitablemetrics for quantifying the exact match information leakage in the document level that arespread in the collections. Furthermore, other probabilistic means are required to measurethe information leakage due to statistical properties of attributes in very large databases.Under those circumstances, we introduce our second method to measure leaked informationdue to statistical correlations.

Query latency as a function of database size

Insertion of disinformation documents increases the size of database and it can negativelyaffect the query execution time. In order to quantify the query latency in cloud DBaaS, aniterative method is employed to evaluate latency of several simple queries on the differentdatabases that contain specific number of documents. In this way, we only focus on a singlevariable, which is the size of the database. The benchmark initially removes all documentsfrom all databases and repopulates those with the required dataset size. Subsequently, twodifferent tests are performed, with and without index. In order to eliminate cache boost-upin the tests, the query caching is disabled. This process is repeated for all the specifieddatabase sizes and the measurement for the benchmark with sample queries is displayed in

20

Figure 4 Five major query classes, including equality check, comparison, logic, range andaggregate are considered for query processing benchmark shown in Figure 4.

One of the biggest reward of our approach which is coming from using the unmodifiedstandard database server, is the benefit of all database technology features such as indexing.Indexing allows to perform more sophisticated search on data, such as binary search, thatreduces the maximum space drastically from O(n) to O(logn) and consequently a remarkableimprovement in performance. Figure 4b presents the improvement of the same queriesexecution time. The chart for a simple query on the non-indexed databases demonstratesthat query latency steadily increases with rise of database size. However, the trend of queryprocessing time remains steady, and shows no significant variations with increasing the sizeof indexed database. The indexed attributes guarantee an insignificant change in queryprocessing time, especially for the encrypted databases which have the augmented size incomparison with the plaintext non-indexed database.

Indexing over data elements offers fast query processing time in the database. Thisexperiment is designed to examine indexing over 10 encrypted and diluted databases withdisinformation documents. To do so we measured the database response time for the querieswhen the data elements are indexed. Then, we measured query processing time with indexedencrypted database. The measurement process in both cases were automated and run underthe control of the designed script which collected the processing time. The results arediscussed in the next subsection.

106

2×

106

3×

106

4×

106

5×

106

6×

106

7×

106

8×

106

9×

106

107

0

1,000

2,000

3,000

4,000

5,000

Number of documents

Lat

ency

(ms)

EqualityComparison

LogicRange

Aggregate

(a)

106

2×

106

3×

106

4×

106

5×

106

6×

106

7×

106

8×

106

9×

106

107

0

5

10

15

20

25

30

Number of documents

Lat

ency

(ms)

EqualityComparison

LogicRange

Aggregate

(b)

Figure 4: The performance analysis of query processing as a function of database size over (a)databases without indexes (b) databases with indexes

21

4.2 Implicit correlation

In database, between two data elements there might be hidden mutual dependencies whichmeans observation of one data item could result in inferring meaningful information aboutthe other. The mutual dependency leaks out sensitive information about secret data whichwas supposed to be confidential. This leakage can be even more high-risk when the datais processed in an outsourced platform such as cloud DBaaS. We measure the informationleakage which is caused by mutual information. To facilitate discussion, an example of sta-tistical property correlation of attributes is given below.

Example 2. A on-line stock exchange website stores, share information in the cloudDBaaS. One simple analysis indicates there are correlation between number of buyers, thequantity of buy orders and quantity of sell orders4. In the scenario of this example theprice trend derived from the statistical properties of two different attributes. The relationaldatabase system is benefited from explicit data field correlations as an effective feature fordatabase normalization and improving query efficiency. However, identifying implicit andsemantically correlated subset of attributes with different data types is a challenging work.We use information theoretical methods to quantify implicit correlations.

Assuming that Π is the set of relations in the collection in the cloud warehouse, and A asthe set of all attributes in Π that are involved in any type of relation with each other. Eachattribute A ∈ A can be presented with a random variable with a distribution function p(A).Consider α = Aimi=1 as subset of Π that contains more than two attributes, potentiallycan be from variety of data type. p(A1, . . . , Am) presents the joint probability function ofall attributes member in α. The proposed correlation function ψ associates any subset ofattributes (more than two attributes) with a non-negative leakage score.

The mutual information measures the amount of information that can be obtain about at-tribute X by observing attribute Y . The mutual information ranges from 0 (if two randomvariables are statistically independent) to H(X) (if the random variables are fully depen-dent). Equation 17 quantifies the dependency between two attributes belonging to any twoselected databases which are represented by two random variables X and Y . The I(X;Y )is quantified amount of leakage.

I(X;Y ) = I(Y ;X) = H(X)−H(X|Y ) (17)

In this expression, H (X) and H (X|Y ) are the entropy of X marginal distributions ofX and Y . Time complexity of Algorithm 2 (see Appendix A) is depended on the number ofdocuments in the database. For a pair of attributes with n instances the time complexityis O(n2). Now, the statistical analysis techniques can be called frequently in the life timeof the dynamic dataset to evaluate the leakage value. Evidently, there is a maximum tol-erable amount of leakage value which is considered as a threshold to initiate disinformationpadding. This technique improves the constant disinformation padding which introduces

4If there are more buyers than sellers, it signifies a price increase; on the other hand, more sellers andhigh volume indicates a price drop.

22

huge overhead for dataset. In other word, instead of having huge number of disinformationdocument, we selectively insert fake documents into the dataset when the leakage value ex-ceeds the threshold value. The proposed technique is presented in Algorithm 3 in AppendixA. The time complexity of algorithm 3 is O(

(n2

)) which simplifies to O(n2).

5 Approximate query processing

On-Line Analytical Processing (OLAP) processes large volumes of data to produce informa-tion regarding the operations of enterprises. Such applications extract data from massivedatasets, but the long response time for searching the entire database restricts the usefulnessof data analytics. Approximate Query Processing (AQP) is a known sampling technique forproviding approximate responses to aggregated queries. An aggregate query is a query thatcalls aggregate functions5 to return a single computed result with significant meaning fromvalue of an attributes of a group of documents. Along with the approximate responses aAQP system supplies confidence intervals indicating the percentage of uncertainty in theapproximated answers.

A very large scale dataset requires a sensitivity analysis to effectively protect the docu-ments from information leakage. One of the methods to protect the documents is disinforma-tion, the replication of documents with sensitive information with altered sensitive fields. Asensitivity analysis determines the number of required disinformation documents correspond-ing to the sensitivity level. Aggregate queries for sensitivity analysis, demand examining allthe database documents. The process of aggregation is unreasonably slow and infeasible forreal-time OLAP applications. Instead of a prolonged aggregation, we proposed using AQPto facilitate fast sensitivity analysis over samples with inaccuracy within acceptable ranges.Observations in the samples are randomly selected from the original database and queriesare directed against this small samples in parallel manner.

5.1 Approximate query processing overview

Random sampling for approximate measurement is a known solutions that dramatically cutsanalysis time, especially if the sample is small enough to fit in the main memory of thesystem. However, approximate measurements based on random sampling only approximatesreal database measurements. The Sampling-based Approximate Query Processing (S-AQP)with guaranteed accuracy provides bounds on the error caused by sampling [22]. Periodicallyassessing the information leakage from large datasets tolerates a certain degree of inaccuracy.The AQP is used for sensitivity analysis and for fast leakage parameter extraction withminimum inaccuracy while limiting the query response time.

5Some common aggregate functions are: Average, count, Max, Min, count

23

5.2 Sampling methods

In the sampling phase observations can be selected by sampling with or without replacement.In the sampling without replacement (disjoint samples), any two samples are independentthus, the their covariance is zero. In sampling with replacement the two sample values aredependent, the observations in the first sample affects what we can get for the second oneand the covariance of the two is not zero. The sampling without replacement complicates thecomputations, but it leads to a slightly more accurate estimation. The resampling processcan be extended to multiple levels to satisfy user’s constraints such as latency or accuracy.Sampling allows AQP to meet latency requirements. If a user rejects the approximativeresults of a AQP the next step is to use larger samples and obtain results with lower errors.Figure 5 presents the sampling method implemented for the experiments reported in thispaper.

Database D

s1,D s2,D sn,D

s2,s1 s2,s2 s2,sms1,s1 s1,s2 s1,sm sn,smsn,s2sn,s1

Aggregate

query

θ

Aggregate

query

θ

Aggregate

query

θ

Aggregate

query

θ

Aggregate

query

θ

Aggregate

query

θ

Aggregate

query

θ

Aggregate

query

θ

Aggregate

query

θ

Figure 5: Execution of an aggregate query θ on multiple random samples. The level of resamplingcan be continued to meet the users expectation in terms of acceptable error rate and latency.

We create four sets of random samples from the original database, each set including 100random samples with 102, 103, 104, and 105 documents in each sample. The samples areselected with and without replacement. Figure 6 shows the error percentile for different sam-ple size for the two sampling different modes. The measurement results show that sampleswithout replacement exhibit slightly more accurate results than samples with replacement.For instance, the average error percentage is 0.22% for the largest sample of 105 documents,whereas the error is 5.08% for the smallest sample size of 100 documents.

24

102 103 104 105

0

0.1

0.2

0.3

0.4

Number of documents in the samples

Err

orp

erce

nta

ge

Without replacementWith replacement

Figure 6: Error percentage with respect to sample sizes (102, 103, 104, 105 documents in a samples).Samples without replacement exhibit slightly more accurate results than samples with replacement.

5.3 Error bounds

A key element of any AQP system is to provide error bounds for the approximative resultsallowing the user to decide whether the results are acceptable. Confidence Intervals (CI)a.k.a error bar which is a range of values that are centered at a known sample mean areused. We use a close-form Central Limit Theorem(CLT) and two other large deviationinequalities, the Markov and Chebyshev inequalities to get tightest bounds. We use Markovinequality to obtain a better bound than the trivial one of 1.0. Additional information onthe variance improves the bounds given by Chebyshev inequality. As the number of elementsin the sample goes to infinity, the distribution converges into the standard normalal randomdistribution N(0, 1). The close-form CLT approach is shown in Equation 18.(∑n

i=1Xi − E[∑n

i=1Xi

]√Var(

∑ni=1Xi)

)n→∞===⇒ N(0, 1) (18)

The tightness of the bounds resulted from the three aforementioned approached areillustrated in Figure 7. Comparing these three approaches it is found that Markov’s inequalityprovides larger deviation bounds than ChebyShev’s inequality. Close-form CLT provides thetightest bound among these three approaches .

0 1

T ightness of bounds

Markov inequality

Chebyshev’s Inequality

CLT closed-form

Figure 7: General comparison between tightness of bounds resulting from Markov, ChebyShev’sinequalities and close-form CLT.

25

5.4 AQP sensitivity analysis

The goal of sensitivity analysis here is to identify different document classes (i.e. Top Secret,Confidential, etc.), their corresponding count and percentage. The result of the sensitivityanalysis will be used to determine the number disinformation documents to be added to thedatabase for level of document sensitivity.

In the experiment we used a database of ten million documents. Table 2 shows the eightsensitivity levels and the number and percentage of documents in each sensitivity class.

Table 2: Document classification

Class Count Percentage

Top Secret 782471 07.823%Secret 1475118 14.751%Information 3134844 31.348%Official 1475603 14.756%Unclassified 783443 07.834%Clearance 783024 07.830%Confidential 782698 07.826%Restricted 782799 07.828%

Total 10000000 100.00%

We use aggregate query to compute the count and percentage of each class of documentsin the collection. Let θ be an aggregate query required to compute over the dataset describedin Table 2. For instance, consider the aggregate query θ in Figure 8, which returns the countand percentage of each class of documents based on their security level.

db[collection].aggregate([

"$group":" id":"clearance":"$clearance", "count":"$sum":1,"$project": "count": 1, "percentage":"$concat":["$substr":["$multiply":["$divide":["$count","$literal":Sample size ],100], 0,6],"", "%"] ]);

Figure 8: Aggregate query for sensitivity analysis of collection. This query will be executed on theoriginal database and 100 sample databases.

5.5 Experimental setup:

The physical machine with Intel(R) Core(TM) i7-4510U CPU, and the Linux kernel version4.4.0-59-generic has been utilized for experiments. For unmodified NoSQL server, MongoBDversion 3.2.7 has been used as the NoSQL database server. The OPE and DHOM cryptosys-tems are implemented locally and other crypto modules are implemented from OpenSSL

26

version 1.0.2g. The measured query latency time is considered as the interval between thetime when the sever receives a query and the time it starts to forward the query result.For accurate measurement of the query latency, the query caching and pre-fetching disabledbecause, most of database servers keep all of the most recently used data in main memory, sothe next matching queries will be served from memory accordingly. Furthermore, MongoDBsupports variety of storage engines which are designed and optimized for various workloads.The storage engine is responsible for data storage both in memory and on disk. We choseWiredTiger storage engine that is well-suited for the most of workloads.

The results show that the average speedup due to AQP is better than linear. A remarkableimprovement in processing time and speedup is achieved and the price to pay for it is anacceptable level of inaccuracy. Both metrics are plotted against variant sample size in Figure9. For the largest sample of 105 documents the average processing time is 93.3 ms, whereasfor the original database (including 107 documents) the processing time the same query is1, 400 ms, a drastic improvement. The speedup of 150 for this sample versus the originaldatabase is remarkable.

102 103 104 105 107

0

50

100

0.45 1.810.8

93

1,420


Pro

cess

ing

tim

e(m

s)

(a)

102 103 104 105

0.1

1

2

3

·104

31,000

8,000

1,400150


Sp

eed

up

(b)

Figure 9: The performance analysis of AQP: (a) processing time of aggregate query over the sampleswith different sizes; (b) the database search significant speed up obtained by using sampling andparallelization in query processing. In this case 100 random samples with 102 105 documents areexamined.

5.6 Discussion

There are two major approaches to manage information leakage. First approach is built uponthe dependency graph of sensitive attributes scattered in the multiple collections stored incloud warehouse. In fact, the query sequence maps on the dependency graph, and accordingto the defined policy, at some points, the queries should be prevented from execution. Morespecifically this method is about keeping track of queries and when the accumulated leaked

27

information reaches to the maximum level of disclosure, the proxy prevents any more queriesto be executed. This method is vulnerable when it is subjected to collaborative attacks, orin particular, when multiple users are trying to pass partial result to the next user to obtainthe maximum leakage. Evidently, this method is not effective against MI. [23]

The second approach makes true information extraction very difficult, by insertion ofdisinformation document into the dataset, due to a verity of fake correlations. One majordraw-back of this method is increase in query processing time as a result of the increasingdatabase size [15].

All the benefits of database technology are transferred to our solution by using standarddatabase server to process encrypted, diluted databases. Indexing is one of these featuresthat significantly improves the query execution time. Detection of correlated attributesin the very large databases demands huge computation power which makes this analysisimpractical in most cases. However, S-AQP method with acceptable level of error, can behelpful to calculate of attributes mutual information, specially, when the chosen sample datais small enough to fit in main memory.

6 Conclusions and future work

Typical Cloud database services ensure advanced availability and scalability, but the dataconfidentiality and integrity are yet an area of interest to be explored further. The problemof secure processing of outsourced datasets with limited leakage investigated in this work.The sensitive data in a single dataset can be protected with using crypto-systems. However,in the cloud DBaaS platform which is a pool of thousands datasets, the aggregation ofdatasets introduces a new source of information leakage. The primary goal of this workis that the untrusted DBaaS should learn minimum information from the accumulation ofdata belonging to the group of users. All risks associated with a untrusted cloud DBaaSinvestigated and a mitigation solution proposed.

For leakage prevention, insertion of disinformation documents studied to break the linksbetween the documents of among a group of datasets. Simultaneously, insertion of fake docu-ments introduces two types of challenges to the systems. First, along with the data overheadgrowth, it causes system performance degradation. Second, the increase in communicationcost between sever and the proxy which is due to transportation of the fakes documents.

We introduced two methods to mitigate the negative effects of data overhead. First,we present leakage quantitative methods to identify the leakage points in datasets; there-fore, a selective disinformation document insertion is proposed to minimize leakage just inthose points instead of the flat document insertion. Second, for performance accelerationof sensitivity analysis, a novel method based on approximate query processing (AQP) issuggested. Using AQP, helps us to conduct sensitivity analysis to scale any database size inthe interactive speed.

User applications expect to receive valid and accurate information in response of the is-sued queries, not the fake information. Most NoSQL databases have a different performancewith processing the same query over different database size. Our experimental benchmarksdemonstrate no significant variations in performance, with a linear increase in the size of

28

database. The small performance penalty is negligible. This can be explained by the mul-tilevel indexing which are utilized by NoSQL databases to provide a fast access time andshort latency for query processing over larger databases. To overcome the second challenge,we propose and analyze an efficient algorithm based on the signature schemes to filter outthe noisy documents.

Acknowledgment

The authors wish to express their gratitude to an anonymous reviewer for constructivecomments. We would like to extend our gratitude to Victor Shoup from New York Universityfor NTL C++ library which is for manipulating arbitrary length integers.

29

Appendix A Leakage quantify algorithms

Algorithm 1 Construction of super document for an entity from multiple data collection

Require: S ⊆ WDBaaS, ε . S is a set of collections, and ε is an entity with document di1: procedure SuperDocument(S,ε)2: L← ∅, |δε| ← di3: for collection Dj ∈ S do4: for document dk ∈ Dj do5: if (µ(δε, dk) == TRUE) then6: Lε ← New arributes7: Update δε8: Back Track9: else

10: continue11: end if12: end for13: end for14: return δε . Super document of entity ε15: end procedure

Algorithm 2 Quantifying the mutual information of attributes from two data sources

Require: S, T : Source and Target data vector, n Length of vector non-empty n > 11: procedure MutualInformation(S,T ,n)2: sProbs← calculateProbility(S, n) . Probability vector of S3: tProbs← calculateProbility(T, n) . Probability vector of T4: MutualInformation← 0.05: jointProbV ector = calculateJointProbability(sProbs, tProbs, n)6: for (i ∈ n2) do7: firstIndex = i mod n8: secondIndex = i/n9: if ((jointProbV ector[i] > 0)

& (sProbs[firstIndex] > 0)& (tProbs[secondIndex] > 0))then

10: mutualInformation+ =jointProbV ector[i]× log(jointProbV ector[i]/(sProbs[firstIndex]× tProbs[secondIndex]))

11: end if12: end for13: mutualInformation/ = log(2.0)14: return mutualInformation15: end procedure

30

Algorithm 3 Selective disinformation insertion algorithm

Require: D . Original datasetRequire: V . Rate of disinformation to informationRequire: T . Acceptable level of leakageOutput: D . Data set diluted with disinformation documents

1: procedure diluteDatabase(D,V , T )2: Ψ← evaluateLeakage(D) . Statistical analysis3: τ ← V4: for each (pair di, dj in D) do5: if (Ψ.L(di, dj) > T ) then6: while (τ) do7: ρ← createDisInformation(di, dj) . New disinformation document ρ is

created.8: insertToDatabase(D, ρ)9: τ ← τ − 1

10: end while11: end if12: end for13: return D . Diluted dataset14: end procedure

References

[1] Amazon web services growth unrelenting. (last accessed 3rd May,2016). [Online]. Available: http://news.netcraft.com/archives/2013/05/20/amazon-web-services-growth-unrelenting.html

[2] N. E. Weiss and R. S. Miller, “The target and other financial data breaches: Frequentlyasked questions,” Congressional Research Service, Prepared for Members and Commit-tees of Congress February, vol. 4, p. 2015, 2015.

[3] J. Silver-Greenberg, M. Goldstein, and N. Perlroth, “Jpmorgan chase hack affects 76million households,” New York Times, vol. 2, 2014.

[4] M. Ahmadian, F. Plochan, Z. Roessler, and D. C. Marinescu, “SecureNoSQL:An approach for secure search of encrypted nosql databases in the publiccloud,” International Journal of Information Management, vol. 37, no. 2, pp. 63– 74, 2017. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0268401216302262

[5] R. A. Popa, C. Redfield, N. Zeldovich, and H. Balakrishnan, “Cryptdb: protectingconfidentiality with encrypted query processing,” in Proceedings of the Twenty-ThirdACM Symposium on Operating Systems Principles. ACM, 2011, pp. 85–100.

31

[6] M. Ahmadian, “SECURE QUERY PROCESSING in CLOUD NoSQL,” in IEEE Inter-national Conference on Consumer Electronics (ICCE) (2017 ICCE), Las Vegas, USA,Jan. 2017.

[7] C. Gentry, “Computing arbitrary functions of encrypted data,” Communications of theACM, vol. 53, no. 3, pp. 97–105, 2010.

[8] M. Ahmadian, A. Paya, and D. Marinescu, “Security of applications involving multipleorganizations and order preserving encryption in hybrid cloud environments,” IEEE In-ternational conf. on Parallel Distributed Processing Symposium Workshops (IPDPSW),pp. 894–903, May 2014.

[9] A. Boldyreva, N. Chenette, Y. Lee, and A. Oneill, “Order-preserving symmetric encryp-tion,” Advances in Cryptology-EUROCRYPT, pp. 224–241, 2009.

[10] C. Liu, L. Zhu, M. Wang, and Y.-a. Tan, “Search pattern leakage in searchable en-cryption: Attacks and new construction,” Information Sciences, vol. 265, pp. 176–188,2014.

[11] R. Ostrovsky, “Efficient computation on oblivious rams,” in Proceedings of the twenty-second annual ACM symposium on Theory of computing. ACM, 1990, pp. 514–523.

[12] O. Goldreich and R. Ostrovsky, “Software protection and simulation on oblivious rams,”Journal of the ACM (JACM), vol. 43, no. 3, pp. 431–473, 1996.

[13] M. Naveed, S. Kamara, and C. V. Wright, “Inference attacks on property-preserving en-crypted databases,” in Proceedings of the 22nd ACM SIGSAC Conference on Computerand Communications Security. ACM, 2015, pp. 644–655.

[14] N. Li, T. Li, and S. Venkatasubramanian, “t-closeness: Privacy beyond k-anonymityand l-diversity,” in 2007 IEEE 23rd International Conference on Data Engineering,April 2007, pp. 106–115.

[15] S. E. Whang and H. Garcia-Molina, “Managing information leakage,” 2010.

[16] S. Agarwal, H. Milner, A. Kleiner, A. Talwalkar, M. Jordan, S. Madden, B. Mozafari,and I. Stoica, “Knowing when you’re wrong: building fast and reliable approximatequery processing systems,” in Proceedings of the 2014 ACM SIGMOD internationalconference on Management of data. ACM, 2014, pp. 481–492.

[17] K. Chatzikokolakis, T. Chothia, and A. Guha, “Calculating probabilistic anonymityfrom sampled data,” Manuscript, vol. 200, no. 9, 2009.

[18] C. Mavroforakis, N. Chenette, A. O’Neill, G. Kollios, and R. Canetti, “Modularorder-preserving encryption, revisited,” Proc. of the 2015 ACM SIGMOD InternationalConf. on Management of Data, pp. 763–777, 2015. [Online]. Available: http://doi.acm.org/10.1145/2723372.2749455

32

[19] C. Gentry et al., “Fully homomorphic encryption using ideal lattices.” in STOC, vol. 9,2009, pp. 169–178.

[20] Z. Brakerski and V. Vaikuntanathan, “Efficient fully homomorphic encryption from(standard) lwe,” SIAM Journal on Computing, vol. 43, no. 2, pp. 831–871, 2014.

[21] P. Paillier, “Public-key cryptosystems based on composite degree residuosity classes,”in Advances in cryptologyEUROCRYPT99. Springer, 1999, pp. 223–238.

[22] S. Agarwal, H. Milner, A. Kleiner, A. Talwalkar, M. Jordan, S. Madden, B. Mozafari,and I. Stoica, “Knowing when you’re wrong: Building fast and reliable approximatequery processing systems,” in Proceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data, ser. SIGMOD ’14. New York, NY, USA: ACM,2014, pp. 481–492. [Online]. Available: http://doi.acm.org/10.1145/2588555.2593667

[23] N. Li, T. Li, and S. Venkatasubramanian, “t-closeness: Privacy beyond k-anonymityand l-diversity,” in Data Engineering, 2007. ICDE 2007. IEEE 23rd International Con-ference on. IEEE, 2007, pp. 106–115.

33

Evaluation of information leakage from cloud database servicecs.ucf.edu/~ahmadian/pubs/Leakage...

Documents

Transcript of Evaluation of information leakage from cloud database servicecs.ucf.edu/~ahmadian/pubs/Leakage...