Privacy Preserving Mining of Association · PDF fileb o oks or w eb pages or TV programs). The...

download Privacy Preserving Mining of Association · PDF fileb o oks or w eb pages or TV programs). The clien ts w an t the serv er to gather statistical information ab out asso ciations among

If you can't read please download the document

Transcript of Privacy Preserving Mining of Association · PDF fileb o oks or w eb pages or TV programs). The...

  • Privacy Preserving Mining of Association Rules

    Alexandre Evfimievski

    Ramakrishnan Srikant Rakesh Agrawal Johannes Gehrke*

    IBM Almaden Research Center650 Harry Road, San Jose, CA 95120, USA

    ABSTRACTWe present a framework for mining association rules fromtransactions consisting of categorical items where the datahas been randomized to preserve privacy of individual trans-actions. While it is feasible to recover association rules andpreserve privacy using a straightforward \uniform" random-ization, the discovered rules can unfortunately be exploitedto nd privacy breaches. We analyze the nature of privacybreaches and propose a class of randomization operatorsthat are much more eective than uniform randomization inlimiting the breaches. We derive formulae for an unbiasedsupport estimator and its variance, which allow us to re-cover itemset supports from randomized datasets, and showhow to incorporate these formulae into mining algorithms.Finally, we present experimental results that validate thealgorithm by applying it on real datasets.

    1. INTRODUCTIONThe explosive progress in networking, storage, and proces-

    sor technologies is resulting in an unprecedented amount ofdigitization of information. It is estimated that the amountof information in the world is doubling every 20 months[20]. In concert with this dramatic and escalating increasein digital data, concerns about privacy of personal informa-tion have emerged globally [15] [17] [20] [24]. Privacy issuesare further exacerbated now that the internet makes it easyfor the new data to be automatically collected and addedto databases [10] [13] [14] [27] [28] [29]. The concerns overmassive collection of data are naturally extending to ana-lytic tools applied to data. Data mining, with its promise toeciently discover valuable, non-obvious information fromlarge databases, is particularly vulnerable to misuse [11] [16][20] [23].An interesting new direction for data mining research is

    the development of techniques that incorporate privacy con-cerns [3]. The following question was raised in [7]: since the

    Department of Computer ScienceCornell University, Ithaca, NY 14853, USA

    Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGKDD02 Edmonton, Alberta, CanadaCopyright 2002 ACM 1-58113-567-X/02/0007 ...$5.00.

    primary task in data mining is the development of mod-els about aggregated data, can we develop accurate mod-els without access to precise information in individual datarecords? Specically, they studied the technical feasibilityof building accurate classication models using training datain which the sensitive numeric values in a user's record havebeen randomized so that the true values cannot be estimatedwith sucient precision. Randomization is done using thestatistical method of value distortion [12] that returns avalue xi + r instead of xi where r is a random value drawnfrom some distribution. They proposed a Bayesian proce-dure for correcting perturbed distributions and presentedthree algorithms for building accurate decision trees [9] [21]that rely on reconstructed distributions.1 In [2], the au-thors derived an Expectation Maximization (EM) algorithmfor reconstructing distributions and proved that the EM al-gorithm converged to the maximum likelihood estimate ofthe original distribution based on the perturbed data. Theyalso pointed out that the EM algorithm was in fact identicalto the Bayesian reconstruction procedure in [7], except foran approximation (partitioning values into intervals) thatwas made by the latter.

    1.1 Contributions of this PaperWe continue the investigation of the use of randomization

    in developing privacy-preserving data mining techniques, andextend this line of inquiry along two dimensions:

    categorical data instead of numerical data, and

    association rule mining [4] instead of classication.

    We will focus on the task of nding frequent itemsets inassociation rule mining, which we briey review next.

    Denition 1. Suppose we have a set I of n items: I =fa1; a2; : : : ; ang. Let T be a sequence of N transactionsT = (t1; t2; : : : ; tN ) where each transaction ti is a subset ofI. Given an itemset A I, its support suppT (A) is denedas

    suppT (A) :=# ft 2 T j A tg

    N: (1)

    An itemset A I is called frequent in T if suppT (A) > ,where is a user-dened parameter.

    We consider the following setting. Suppose we have aserver and many clients. Each client has a set of items (e.g.,

    1Once we have reconstructed distributions, it is straightfor-ward to build classiers that assume independence betweenattributes, such as Naive Bayes [19] .

  • books or web pages or TV programs). The clients want theserver to gather statistical information about associationsamong items, perhaps in order to provide recommendationsto the clients. However, the clients do not want the serverto know with certainty who has got which items. When aclient sends its set of items to the server, it modies theset according to some specic randomization policy. Theserver then gathers statistical information from the modiedsets of items (transactions) and recovers from it the actualassociations.The following are the important results contained in this

    paper:

    In Section 2, we show that a straightforward uniformrandomization leads to privacy breaches.

    We formally model and dene privacy breaches in Sec-tion 3.

    We present a class of randomization operators in Sec-tion 4 that can be tuned for dierent tradeos betweendiscoverability and privacy breaches. We derive for-mulae for the eect of randomization on support, andshow how to recover the original support of an associ-ation from the randomized data.

    We present experimental results on two real datasetsin Section 5, as well as graphs showing the relationshipbetween discoverability, privacy, and data characteris-tics.

    1.2 Related WorkThere has been extensive research in the area of statistical

    databases motivated by the desire to provide statistical in-formation (sum, count, average, maximum, minimum, pthpercentile, etc.) without compromising sensitive informa-tion about individuals (see surveys in [1] [22].) The pro-posed techniques can be broadly classied into query re-striction and data perturbation. The query restriction fam-ily includes restricting the size of query result, controllingthe overlap amongst successive queries, keeping audit trailof all answered queries and constantly checking for possi-ble compromise, suppression of data cells of small size, andclustering entities into mutually exclusive atomic popula-tions. The perturbation family includes swapping valuesbetween records, replacing the original database by a sam-ple from the same distribution, adding noise to the valuesin the database, adding noise to the results of a query, andsampling the result of a query. There are negative resultsshowing that the proposed techniques cannot satisfy the con-

    icting objectives of providing high quality statistics and atthe same time prevent exact or partial disclosure of individ-ual information [1].The most relevant work from the statistical database lit-

    erature is the work by Warner [26], where he developedthe \randomized response" method for survey results. Themethod deals with a single boolean attribute (e.g., drug ad-diction). The value of the attribute is retained with prob-ability p and ipped with probability 1 p. Warner thenderived equations for estimating the true value of queriessuch as COUNT (Age = 42 & Drug Addiction = Yes). Theapproach we present in Section 2 can be viewed as a gener-alization of Warner's idea.Another related work is [25], where they consider the

    problem of mining association rules over data that is ver-tically partitioned across two sources, i.e, for each transac-tion, some of the items are in one source, and the rest in the

    other source. They use multi-party computation techniquesfor scalar products to be able to compute the support of anitemset (when the two subsets that together form the item-set are in dierent sources), without either source revealingexactly which transactions support a subset of the itemset.In contrast, we focus on preserving privacy when the datais horizontally partitioned, i.e., we want to preserve privacyfor individual transactions, rather than between two datasources that each have a vertical slice.Related, but not directly relevant to our current work,

    is the problem of inducing decision trees over horizontallypartitioned training data originating from sources who donot trust each other. In [16], each source rst builds a lo-cal decision tree over its true data, and then swaps valuesamongst records in a leaf node of the tree to generate ran-domized training data. Another approach, presented in [18],does not use randomization, but makes use of cryptographicoblivious functions during tree construction to preserve pri-vacy of two data sources.

    2. UNIFORM RANDOMIZATIONA straightforward approach for randomizing transactions

    would be to generalize Warner's \randomized response" me-thod, described in Section 1.2. Before sending a transactionto the server, the client takes each item and with probabil-ity p replaces it by a new item not originally present in thistransaction. Let us call this process uniform randomization.Estimating true (nonrandomized) support of an itemset

    is nontrivial even for uniform randomization. Randomizedsupport of, say, a 3-itemset depends not only on its truesupport, but also on the