Privacy Preserve Electronic Health Data

12
A framework to preserve the privacy of electronic health data streams Soohyung Kim a , Min Kyoung Sung b , Yon Dohn Chung b,a Department of IT Convergence, Korea University, Anam-dong, Seongbuk-gu, Seoul, Republic of Korea b Department of Computer Science and Engineering, Korea University, Anam-dong, Seongbuk-gu, Seoul, Republic of Korea article info Article history: Received 26 August 2013 Accepted 27 March 2014 Available online 4 April 2014 Keywords: Health data stream Privacy Anonymization abstract The anonymization of health data streams is important to protect these data against potential privacy breaches. A large number of research studies aiming at offering privacy in the context of data streams has been recently conducted. However, the techniques that have been proposed in these studies generate a significant delay during the anonymization process, since they concentrate on applying existing privacy models (e.g., k-anonymity and l-diversity) to batches of data extracted from data streams in a period of time. In this paper, we present delay-free anonymization, a framework for preserving the privacy of elec- tronic health data streams. Unlike existing works, our method does not generate an accumulation delay, since input streams are anonymized immediately with counterfeit values. We further devise late valida- tion for increasing the data utility of the anonymization results and managing the counterfeit values. Through experiments, we show the efficiency and effectiveness of the proposed method for the real-time release of data streams. Ó 2014 Elsevier Inc. All rights reserved. 1. Introduction Recently, data streams of health records have been widely uti- lized in many applications, such as real-time diagnosis systems, biomedical data analysis, and health monitoring services [1–4]. Diagnosis systems, for example, diagnose patients’ diseases by monitoring the behavior of their real-time health data, and bio- medical data analysts study the patterns of diseases by using the data streams from bio-signal sensors. In many cases, the data streams are entrusted to third parties who analyze the data in place of the data holders, because of the effective data utilization. Meanwhile, health data streams typically contain private attri- butes, such as age, gender, various bio-signals, and diagnoses. Pri- vacy problems can arise in relation to these types of data during the transmission of the data streams to third parties, which can provide paths for a privacy attack. For example, consider a data stream with the schema (tupleID, time, age, sex, diagnosis) released from a hospital for the purpose of a biomedical study. The schema does not contain any direct identifiers, such as SSNs (social security numbers), that can distinguish individuals. However, using some background knowledge, i.e., time, age, and sex, an attacker can identify the diagnosis of the target individual [5,6]. To protect data streams from background knowledge attacks, several privacy-preserving methods have been recently proposed [7–14]. These methods ensure that the released data streams meet the requirements of typical privacy models, such as k-anonymity [5] and l-diversity [6]. Although these methods differ in the way they transform the data streams to offer privacy protection, they are all based on the tuple accumulation strategy. That is, streaming tuples are postponed until they satisfy the given publication condi- tion, i.e., reach a certain privacy level or satisfactory data utility. We call this type of method the accumulation-based method. Fig. 1 shows an example of offering privacy protection in data streams by using the accumulation-based method. Assume an in- put stream of tuples, where a tuple consists of multiple attributes that include personal attributes, i.e., QI (quasi-identifiers) and SI (sensitive information). In the accumulation-based method, the anonymization system continues to accumulate and cluster the in- put tuples until the given privacy condition, such as k-anonymity and l-diversity, is satisfied. In Fig. 1, tuple 1, which has arrived at time T 1 and has been accumulated in Cluster 1, waits for tuple 2. When tuple 2 arrives, it is assigned to Cluster 1, then the cluster becomes 2-diverse. When a certain cluster satisfies the privacy condition, the tuples in the cluster are generalized and released. Then, the QIs of the tuples in the cluster become indistinguishable from each other according to the privacy condition. Consequently, an adversary cannot identify an exact tuple from the anonymized data by using the QIs (i.e., background knowledge). Thus, an http://dx.doi.org/10.1016/j.jbi.2014.03.015 1532-0464/Ó 2014 Elsevier Inc. All rights reserved. Corresponding author. Address: Department of Computer Science and Engi- neering, College of Information and Communications, Korea University, Anam- dong, Seongbuk-gu, Seoul 136-713, Republic of Korea. E-mail addresses: [email protected] (S. Kim), [email protected] (M.K. Sung), [email protected] (Y.D. Chung). Journal of Biomedical Informatics 50 (2014) 95–106 Contents lists available at ScienceDirect Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin

description

Privacy Preserve Electronic Health Datafor Reference

Transcript of Privacy Preserve Electronic Health Data

A framework to preserve the privacy of electronic health data streamsSoohyung Kima, Min Kyoung Sungb, Yon Dohn Chungb,aDepartment of IT Convergence, Korea University, Anam-dong, Seongbuk-gu, Seoul, Republic of KoreabDepartment of Computer Science and Engineering, Korea University, Anam-dong, Seongbuk-gu, Seoul, Republic of Koreaarti cle i nfoArticle history:Received 26 August 2013Accepted 27 March 2014Available online 4 April 2014Keywords:Health data streamPrivacyAnonymizationabstractThe anonymization of health data streams is important to protect these data against potential privacybreaches. A large number of research studies aiming at offering privacy in the context of data streamshas been recently conducted. However, the techniques that have been proposed in these studies generatea signicant delay during the anonymization process, since they concentrate on applying existing privacymodels (e.g., k-anonymity and l-diversity) to batches of data extracted from data streams in a period oftime. In this paper, we present delay-free anonymization, a framework for preserving the privacy of elec-tronic health data streams. Unlike existing works, our method does not generate an accumulation delay,since input streams are anonymized immediately with counterfeit values. We further devise late valida-tionforincreasingthedatautilityoftheanonymizationresultsandmanagingthecounterfeitvalues.Through experiments, we show the efciency and effectiveness of the proposed method for the real-timerelease of data streams. 2014 Elsevier Inc. All rights reserved.1. IntroductionRecently, data streams of health records have been widely uti-lizedinmanyapplications, suchasreal-timediagnosissystems,biomedical dataanalysis, andhealthmonitoringservices[14].Diagnosis systems, for example, diagnosepatients diseases bymonitoringthebehavioroftheirreal-timehealthdata, andbio-medical data analysts study the patterns of diseases by using thedatastreams frombio-signal sensors. Inmanycases, thedatastreamsareentrustedtothirdpartieswhoanalyzethedatainplace of the data holders, because of the effective data utilization.Meanwhile, health data streams typically contain private attri-butes, such as age, gender, various bio-signals, and diagnoses. Pri-vacy problems can arise in relation to these types of data duringthetransmissionofthedatastreamstothirdparties, whichcanprovidepathsforaprivacyattack. Forexample, consideradatastream with the schema (tupleID, time, age, sex, diagnosis) releasedfrom a hospital for the purpose of a biomedical study. The schemadoes not contain any direct identiers, such as SSNs (social securitynumbers), thatcandistinguishindividuals. However,usingsomebackgroundknowledge, i.e., time, age, andsex, anattackercanidentify the diagnosis of the target individual [5,6].Toprotectdatastreamsfrombackgroundknowledgeattacks,several privacy-preserving methods havebeen recently proposed[714]. These methods ensure that the released data streams meetthe requirements of typical privacy models, such as k-anonymity[5] andl-diversity [6]. Although thesemethods differ inthe waythey transform the data streams to offer privacy protection, theyare all based on the tuple accumulation strategy. That is, streamingtuples are postponed until they satisfy the given publication condi-tion, i.e., reachacertainprivacylevelorsatisfactorydatautility.We call this type of method the accumulation-based method.Fig. 1 shows an example of offering privacy protection in datastreams by using the accumulation-based method. Assume an in-put stream of tuples, where a tuple consists of multiple attributesthatincludepersonalattributes, i.e., QI(quasi-identiers)andSI(sensitiveinformation). Intheaccumulation-basedmethod, theanonymization system continues to accumulate and cluster the in-put tuples until the given privacy condition, such as k-anonymityand l-diversity, is satised. In Fig. 1, tuple 1, which has arrived attime T1and has been accumulated in Cluster 1, waits for tuple 2.When tuple 2 arrives, it is assigned to Cluster 1, then the clusterbecomes2-diverse. Whenacertainclustersatisestheprivacycondition, thetuplesintheclusteraregeneralizedandreleased.Then, the QIs of the tuples in the cluster become indistinguishablefrom each other according to the privacy condition. Consequently,an adversary cannot identify an exact tuple from the anonymizeddata by using the QIs (i.e., background knowledge). Thus, anhttp://dx.doi.org/10.1016/j.jbi.2014.03.0151532-0464/ 2014 Elsevier Inc. All rights reserved.Correspondingauthor. Address: Department of ComputerScienceandEngi-neering, Collegeof InformationandCommunications, KoreaUniversity, Anam-dong, Seongbuk-gu, Seoul 136-713, Republic of Korea.E-mail addresses: [email protected] (S. Kim), [email protected](M.K. Sung), [email protected] (Y.D. Chung).Journal of Biomedical Informatics 50 (2014) 95106ContentslistsavailableatScienceDirectJournal of Biomedical Informaticsj our nal homepage: www. el sevi er . com/ l ocat e/ yj bi nadversarycannotbecertainaboutthetargetsSI, althoughheorshe can infer the list of possible SIs.In addition to employing accumulationto satisfy the requiredprivacy level, most existing methods aim at accumulating more tu-ples than necessary, in order to reduce the information loss that iscaused by the applied data generalization. For example, tuple 4 inFig. 1 is accumulated in Cluster 3 instead of Cluster 2, because if itwas accumulated in Cluster 2 this would lead to increased infor-mation loss.There are two inevitable delay problems in the accumulation-basedmethods:accumulationandcomputationdelay. Theformerisdirectlyrelevanttotheaccumulationof theinputtuples. Forexample, inFig. 1, thereleaseoftuple3isdelayeduntiltuple6hasarrived. Thedelayisduetothetimetakentocalculatetheinformation loss. To assign an input tuple to an appropriate clusterthat generates the minimum information loss, the information lossof everyclustermustbecalculated. Computingtheinformationloss of all clusters takes a signicant amount of time, which pro-hibits the release of large data streams in real time.To solve these problems, in this paper, we propose a DF (delay-freeanonymization)frameworktopreservetheprivacyof elec-tronic health data streams in real time. We consider two types ofattributes of an input tuple: the quasi-identier (QI) and the sensi-tiveinformation(SI). IntheDFframework, QIsarereleasedwitharticially generated l-diverse SI values. For example, in Fig. 2, tu-ple 1 is anonymized with ST1, which is the l-diverse set of the SI.We organize the set with the original SI from the input tuple andthearticiallygeneratedSI. Then, QI1hasmultipleSIs(SIaandSId), whichsatises 2-diversity. Thus, attackers cannot identifytheexactSIofthetargetQI. Inthismanner, ouranonymizationprocess preserves the privacy of the health data streams.We would like to note that there is no accumulation delay in DFbecause it immediately anonymizes the input tuples and releasesthem. Thisispossiblebecausethereisnotupleaccumulationinour anonymization process. Furthermore, in comparison with theaccumulation-based methods, DF generates remarkably little com-putationdelay, becauseDFexecutesonlysimpleoperations(i.e.,referring to the sensitive table and managing it) instead of compli-cated operations (e.g., calculating the information loss).Wealsoconsideredthedatautilityoftheanonymizationre-sults. Ouranonymizationmethod, generatingl-diverseSIvalues,generates counterfeit values. For example, in Fig. 2, SIdin ST1is acounterfeit at T1. Counterfeit values function as a protection mech-anism to prevent the disclosure of the correct SI values; however,they also decrease data utility. To ameliorate this issue, we proposelate validation, which utilizes the SI values of the newly entered tu-ples to validate the existing counterfeits. In Fig. 2, SId in ST1 is latevalidated at T4. After tuple 4, whose SI is SId, is anonymized usingST1; SId in ST1 is considered a real value. To measure the data util-ity, including the utility loss from the counterfeits, we further de-vise sensitive attribute uncertainty as the quality metric for DF.The contributions of our delay-free anonymization frameworkare summarized as follows:v Immediaterelease:DFfacilitatestuple-by-tuplereleasewiththe guarantee of l-diversity. DF is characterized by no accumu-lation delay and a low computation cost, since it immediatelyreleases the data streams and requires only simple operations.The immediate release is a big advantage for applications thatutilize real-time data streams.v High-level data utility: To anonymize data streams, DF arti-cially generates l-diverse SI values instead of generalizing QIs.Thus, DFdoesnotgeneratethetypical informationlosswithrespect to QIs. In addition, the counterfeits in the sensitive tableare also minimized by late validation.The rest of this paper is organized as follows. In Section 2, wepresent the related work in this area. In Section 3, we dene thedata model and propose DF. In Section 4, we present a late valida-tion scheme for effective anonymization, and the development oftheDFalgorithm. InSection5, wediscussdatautilityissuesinthecontext of ourproposedframework. Section6presentsourexperimental evaluation and Section 7 concludes this work.2. Related workA large number of research studies in the area of data privacyhasbeenconductedinthelastdecade. Inwhatfollows, werstpresent methods that have been proposed to offer privacy in thecontext of staticrelational data. Followingthat, wediscussap-proaches that have been proposed for offering dynamic data pri-vacyanddatastreamprivacy. Thelatteraredirectlyrelevanttoour approach.2.1. Privacy for static, relational dataA signicant amount of work has been conducted in the area ofprivacy for relational data. Several studies (e.g., [5,15,16]) proposedmethodsthatemployk-anonymitytoblockre-identicationat-tacks, in which individuals in the released microdata are identiedbased on QIs. To solve this problem, k-anonymity makes k tuplesindistinguishablebygeneralizingtheQIs. l-diversityisamodelthatsolvesthelackofdiversityink-anonymity[6]. ItmakesSIsl-diverse with indistinguishable QIs. Anatomy [17] is a techniqueFig. 1. Privacy protection of a data stream using the accumulation-based method(an example of offering 2-diversity).Fig. 2. Privacyprotectionof adatastreamusingthedelay-freeanonymizationapproach (an example of offering 2-diversity).96 S. Kim et al. / Journal of Biomedical Informatics 50 (2014) 95106that also uses the l-diverse tuples similarly to the method proposedin [6]; however, it does not generalize QIs. Thus, although an at-tackerknowstheexacttargetQI isincludedinthedataset, sheor he cannot distinguish which SI belongs to the target.Theauthorsof t-closeness[18] proposedamethodthat ad-dresses certain limitations of l-diversity. Two possible attacks canbe made against l-diversity: a skewness attack and similarity attack.In order to prevent these attacks, t-closeness requires that the dis-tribution of the sensitive attribute values meets a certain condition.The above-mentioned studies deal only with relational data anddonotconsideranyinsertion, deletionorupdateoperationsap-plied on these data. As a result, these methods are not applicablein the context of data streams.2.2. Dynamic data privacyInsertions, deletions, and updates frequently occur in real data-sets. Several methods have been proposed to protect dynamicallyevolving datasets, ensuring that the latest version of the data canbe safely published. It is important to note that previous privacy-protectionmethods, involvingstaticdata, arenot applicableinthe context of dynamic datasets because they assume only a singledata release. In dynamic datasets, a new type of privacy problemcan arise because of the potential linkage between different datareleases. Therefore, severalprivacypreservationtechniques[1921]protecttheprivacyofdynamicreleasesbyconsideringtheircontext or maintaining extra information.Byunetal. presentedananonymizationmethodfordynamicdata; however, this workconsideredonlyinsertionoperations[19]. Xiao et al. in [20] proposed m-invariance, a privacy methodwhich considersboth insertion anddeletion operations. Bu etal.addressed another privacy problem that occurs by the permanentsensitive values in the dynamic datasets [21].The methods proposed in these studies deal with multiple re-leases, which are analogous to data streams. However, they cannotbe employed to anonymize data streams since they deal with theanonymization of the entire table and cannot be applied on datathat becomes available in real time.2.3. Data stream privacyRecently, there have been several studies on preserving the pri-vacy of data streams. Data streams consist of sequences of tuplestransmittedinreal time, andhence, theassumptionconcerningthe requirements of the privacy preservation technique is differentfrom that in the previous privacy protection methods. To preservethe privacy of data streams, the unit of anonymization must be atuple, and the anonymization should be performed in real time.To the best of our knowledge, the rst study that dealt with theprivacy preservation of data streams is [9], in which the specializa-tion tree for accumulating data streams was introduced. Nodes inthetree areclassied intocandidatenodes andwork nodes,anddata streams are generalized by the node structure.TheCASTLEschemegeneratesclustersinwhichthetuplesofdatastreamsareaccumulated[7]. Then, it releasestheclusterswhenthesizeoftheaccumulationreachesthedelay-constraint,d. Basically, CASTLEguarantees k-anonymity, andalsoprovidesan algorithm for l-diversity.Themethodpresentedin[8] uses aprobabilityfunctiontodeterminethereleaseof datastreams;italsoaccumulatesdatafromdatastreamsanddecidesonthereleasetimeaccordingtothisfunction. Thefunctionpromotesreleaseswhentheaccumu-lateddataexperienceless informationloss but alonger delay.For the sake of data utility, the method proposed in this study alsomakes a prediction on the data that may appear in the streams byconsidering the data distribution.TheauthorsofSABRE[14]presentedanalgorithmforanony-mizingmicrodatatosatisfyt-closeness, andalsoextendedthealgorithmto anonymize data streams. The SABRE frameworkmaintains sliding windows that function as buffers of the input tu-ples. When tuples in a certain window are expired due to newly in-sertedtuples, oldtuples shouldbe released. These tuples areanonymized by SABRE and released to an output stream.The authors of B-CASTLE [10] presented an improved algorithmof CASTLE [7]. Thealgorithm ofB-CASTLE considers thedistribu-tionofdataindatastreamswhentheinputdataisclustered, inan effort to improve data utility. The method presented in [12] alsoconsiders the density distribution to improve the utility of the datain the anonymized stream. The method constructs a TDS-tree (top-down specialization tree) of QIs, and considers the density distri-bution of tuples in each leaf of the tree. The authors of SANATOMY[11]employedthebucketization approach(i.e. Anatomy [17])toanonymizedatastreams, insteadofthegeneralizationapproach.SANATOMY preservesthedatacorrelation, duetothebucketiza-tionapproach. Theauthorsof[13]pointedoutthatthepreviousstudies hadbeenbasedontime-consumingalgorithms duetothe k-anonymity model. The algorithm presented in [13] is an ef-cientanonymizationalgorithm, whichiscluster-basedandsup-ports numerical data streams.All thestudiesondatastreamprivacypreservationthusfarhave been based on the tuple accumulation strategy. Consequently,the existing methods suffer from several limitations in terms of thereal-time release of data and their applicability in a distributed set-ting, as also mentioned in Section 1.3. The delay-free anonymization frameworkInthissection, wepresentthedelay-freeanonymization(DF)frameworkforprotectingtheprivacyof thedatainhealthdatastreams. First, wedenethedatamodel of datastreams. Then,we describe our proposed method in detail.3.1. Data modelWe assume real-time health data streams containing personalinformation. Among the various attributes in a data stream, we fo-cus on only three main attributes in order to simplify the problem:tupleID, QI, and SI. According to the attributes, a data stream can bedescribed as(tupleID; QIs; SI)TupleID indicates the unique number of a tuple. It is usually re-moved before releasing the data stream, because it can be an iden-tierof thetuple. TheQI isanattributeof anindividual inthetuple, such as age, sex, nationality, and zipcode. The SI is the pri-vate attribute, which should not be directly related to the individ-ual, such as bio-signals and diagnosis. In the following denition ofthe data stream, we omit the tupleID, since it is consequentially re-moved in the anonymization process.Denition1(Datastream). Let QI bethequasi-identier attri-butes of a tuple, where QI = {qi1; qi2, . . .,qii}, andlet si be thesensitive information of a tuple. We dene a data stream S as a setof tuples (QI; si).3.2. Delay-free anonymizationThe goal of the delay-free anonymization scheme is to organizel-diverse data streams in real time, and guarantee that theprobability of guessing the individuals sensitive informationshouldbeequal toorlessthan1/l. InordertoanonymizedataS. Kim et al. / Journal of Biomedical Informatics 50 (2014) 95106 97streams, we adopt the ideas behind the Anatomy technique [17],which preserves privacy by dividing the QI and SI. Then, we inateeach si of the input tuples into l-diverse values. Assume that a tuplet(QIt; sit) has arrived. DF generates a QIT (quasi-identier tuple) andan ST (inated sensitive tuple) and releases them immediately. TheQIT is generated from QIt in the tuple t and the ST is an l-diverse setof the articial si that contains sit. The QIT and ST can be joined bythe join key groupID.Denition2(Delay-freeanonymization). Givenatuplet(QIt; sit)from a data stream and a domain of the sensitive information DSI,DF generates the QIT and l-diverse ST. The schema of the QIT is(groupID; QI);where groupID is a positive integer and QI is from QIt. For a groupIDj; QITjhas the schema (j; QI).The schema of the ST is(groupID; si; count);where groupID and count are positive constants, and si is an elementof DSI. ForagroupIDj; STjhastheschema (j; si; count(si))wherecount(si) denotes the number of si.Example 1. [Delay-free anonymization] Let us assume that a datastream ((age; sex); diagnosis)isgeneratedbythehospitalandtheanonymization result of the data stream must satisfy 2-diversity.Given that a tuple t1((24; male); Diag:A) has arrived, DF generatestheQIT (1; (24; male))where1isarandomlygeneratedgroupID.The Diag:A, an SI of t1, is inated into an l-diverse ST (1; Diag:A; 1)and (1; Diag:B; 1) with groupID 1 and count 1.The QI of t1 has 2-diverse SIs as a result of applying DF. Thus, anadversary cannot nd the exact sensitive value of (24; male) with aprobability greater than 1/2. Therefore, the result of the anonymi-zation satises 2-diversity.It is important to mention thatDiag.B,the inated SI, is not areal value. It is counterfeit and invalid, because it is arbitrarily gen-eratedfromthedomainofthesensitiveinformationtopreserveprivacy. Thecounterfeitscandecreasetheutilityofanonymizedresults; thus, in Section 4, we present an effective mechanism forthegenerationandmanagement of counterfeits, whichaimstomaintain high data utility.3.3. Privacy preservationIn order to protect privacy of data streams, we aim at guaran-teeingl-diversityonDF, whichallowsadversariestoguesstheindividualssensitiveinformationwithaprobabilityequal toorless than 1/l. Since the aim of l-diversity on DF is slightly differentfrom the traditional l-diversity, in this section, we discuss the pri-vacy preservation achieved by DF and dene l-diversity in the con-text of DF.First, in order to guarantee the 1/l probability, it is essential thatthe number of distinct si values in an arbitrary group are equal toor more than l. Consider an arbitrary group j and its sensitive tuplesSTj. The rst condition for l-diversity on STjisl 6 [STj[ (1)It should be noted that [STj[, which denotes the size of STj, also rep-resentsthenumberofdistinctvaluesinSTj, becausethereisnoduplicatedsi inSTj(weusethecountvaluetodescriberepeatedsensitive values).Second, the frequency of a sensitive value should be considered.LetusfocusonthecountattributeintheST. InExample1, thecountvalueofeachsiis1. Thus, theratiooftheDiag.AandtheDiag.B is 1:1 for theQI oft1. It is2-diverse, because the attackerwho has background knowledge (QI of t1) cannot identify an exactsensitive value with more than 1/2 probability. Meanwhile, let usassume the count value of Diag.A = 2 and Diag.B = 1. Then, the ratiobecomes 2:1and theprobability that theQIoft1corresponds toDiag.Abecomes 2/3, whichis greater than1/2probability. Wecan derive a conclusion that the count values in the ST should begenerated such that they guarantee the 1/l probability.Consider anarbitrary groupj,and its sensitive tuplesSTj. Theprobabilitythatthearbitrarysiiisasensitivevalueofagivenqiis P(SIqi = sii) = count(sii)=[[STj[[, where[[STj[[ is the sumof allcounts in group j. It should be noted that the probability denotesthe proportion of the arbitrary sensitive value siiamong all sensi-tivevaluesinSTj. Inordertoprotectprivacy, theprobabilityofevery si in STj must be equal to or less than 1/l. Now, we can deter-mine the count value.P(SIqi = sii) _ 1=lcount(sii)=[[STj[[ _ 1=lcount(sii) 6 [[STj[[=l (2)Finally, we dene l-diversity on DF that formalizes the two con-ditions above.Denition 3(l-diversity on DF). Let us assume that an anonymizedresultbytheDFframeworkconsistsof QIT (groupID; QI), andST(groupID; si; count). Consider that STj (j; si; count(si)) is an arbitrarysubset of the ST, whose groupID is j. The anonymized result satisesl-diversity on DF if and only if l 6 [STj[ and \ stj[count(si)[ 6 [[STj[[=l,where stj STj. [STj[ is the number of distinct si values in group j,and [[STj[[ is the sum of all counts in group j.4. Effective counterfeit managementAt least l-1 counterfeit values are generated when anonymizinga tuple by l-diversity. In this section, we propose the late validationmethod, which reduces the generation of counterfeits by utilizingnewlyinputtuples. Then, weshowthedevelopmentofanalgo-rithm for DF with late validation.4.1. Sensitive group managementLate validation is a process that replaces an existing counterfeitwith a valid sensitive value using an input tuple that will be anon-ymized. Forexample, let usassumeaninputtuplet1hasbeenanonymizedas Example1. TheanonymizedresultsQITandSTmight be interpreted by applications asThe sensitive value of (24; M) is related to ST//1There is only one tuple in QIT1, whereas the sum of the count inST1 [[ST1[[ is 2, which means there is one counterfeit sensitive value(Diag.B) in addition to the real value (Diag.A) in ST1. Then, assumean input tuple t2 has arrived and its sensitive value is a Diag.B. If t2is anonymized according to DF as described in Section 3.2, QIT2 andST2, which includes Diag.B as a sensitive value, are generated. Now,we need to focus on the counterfeit value of ST1, Diag.B. Using thisvalue, instead of anonymizing the input tuple, we can relate the Di-ag.B to ST1 asThe sensitive value of (32; F) is related to ST//1This means that t2has been included in group 1 and there aretwo tuples inQIT1. Since [[ST1[[is2, applications mightinterprettheresultasthat bothDiag.AandDiag.Barevalid values. Inthismanner, the generation of counterfeits can be reduced and existingcounterfeits can be late validated with real values.98 S. Kim et al. / Journal of Biomedical Informatics 50 (2014) 95106Meanwhile, to achieve correct validation, the limit of the countvaluemustbeconsideredwhilelatevalidating. Forexample, as-sume that the sensitive value of t2is Diag.A, not Diag.B. Since thecount of Diag.A inST1is 1,t1occupies the count. Hence, t2mustnot be released with QIT1as in the above case. Thus, the numberof tuples that can be late validated must be restricted to the countof the sensitive value. To make this possible, the metadata of re-leased tuples should be maintained in the anonymization system.We dene this information as released tuples RT.Denition 4(Released tuples). Given groupID and si, the schema ofan RT is(groupID; si; countr);wherecountr(si) isthenumberof thesensitiveinformationele-ments of the tuple that have already been released.RTj, which denotes the RT with the groupID j, is generated with STj.Thus, always [RTj[ = [STj[. The RT is the information that is utilizedonlyintheanonymizingprocess, butisneverreleased. Now, wecan dene late validation as follows.Denition5(Latevalidation). GivenQITj; STj; RTj, andaninputtuplet(tupleIDt; QIt; sit); tcanbereleasedas (j; QIt), ifandonlyifcount(sit) and countr(sit) exist, countr(sit) < count(sit), andQItRQITj.In the anonymized counterpart of a data stream, an individualcan appear multiple times, unlike in those of relational data. How-ever, when multiple QIs related to an individual appear in the samegroup, the l-diversity requirement cannot be guaranteed since theadversaries can infer the sensitive information with a probabilitygreater than 1/l. In order to avoid the above problem, the late val-idationmethodprevents multiple, repeateddeployment of thesame QIs in a group by checking that QItRQITj.4.2. Sensitive value generationIn Section 3.2, the l-diverse sensitive values are arbitrarily gen-erated when ST is organized. The arbitrary generation of sensitivevaluesmayhaveasignicantimpactondatautilityandprivacyprotection.The generated values signicantly affect the data utility result-ingfromtheprocessof latevalidation, becauselatevalidationispossible when the generated sensitive value matches the value ofthe input tuple. For the sake of data utility, it is important to gen-erate sensitive values that are likely to appear in the input stream.A good prediction of the input data would be the best way to in-crease late validations. To achieve this, in this work we assume thatthe future is reected by the past. More specically, we utilize thepastdataforgeneratingthesensitivevalues. First, weprepareapoolofsensitivevaluesfromthepastdatastreams. Anelementof the pool consists of (si; count(si)), which is analogous to the ele-ment of ST. Then, by randomly choosing elements from the pool, l-diverse ST is organized.It is known that the l-diversity model suffers from two attacks(skewness attack and similarity attack) in terms of the privacy pres-ervation. Consider agroupthat has twovalues: uandcough.Although it satises 2-diversity, the patients privacy may not bepreservedsincethevaluesaresemanticallysimilar. Asanotherexample, consider3-diversevaluesofbloodpressure(150; 151,and153). Thepatientsprivacymaynotbepreservedsincetherangeofthevaluesistoodense. Theprivacyofl-diversedataisnot properly preserved when the sensitive values are similar. In or-der to block such attacks, Ninghui Li et al. proposed the concept oft-closeness [18], which considers the distribution of the sensitivevalues. However, thisconceptisbeyondthescopeofthispaper,since we focus primarily on the l-diversity model.4.3. The algorithmIn this section we propose an algorithm for delay-free anonymi-zation. Thealgorithmisseparatedintotwoparts:latevalidation(lines 912) and counterfeit generation (lines 1428). When a tuplehas arrived, the algorithm nds an existing counterfeit value thatcan be validated by the input tuple according to Denition 5 (line7). If there is a suitable counterfeit, the input tuple is released inthe form of a QIT with the groupID of the counterfeit (line 9). Then,the RT is updated with an increased count value (line 12). If thereare no counterfeits that can be validated, the algorithm generates l-diverse counterfeit sensitive values. First, the QIs of the tuple arereleased intheform ofaQITwithanassignedgroupID(line 14).Then, the function l-diverse_ST_Generation generates l-diverse sen-sitive values including the sensitive value of the input tuple (line17). Sincethesetalreadysatisesl-diversityunderDenition3,the pairs of the set can be released in the form of an ST with theassignedgroupID(line19). Then, inordertomaintainmetadataforlatevalidation, theRTisstoredintheanonymizationsystem(lines2124). Thecountvalueofthecounterfeitsensitivevalueis initialized as 0, while the count value of the input tuples sensi-tive value is initialized as 1.Algorithm 1. DELAY-FREE ANONYMIZATION(T,l)1: QIT = O; ST = O; RT = O2: gcnt = 0; qcnt = 0; scnt = 03: let qit; st, and rt be elements of QIT; ST and RT, respectively4: while t Tis not empty do5: let QItand sitbe t[QI[ and t[si[, respectively6: let QISIDbe a set of all qit[QI[ where qit[groupID[ = ID7: let stibe a randomly chosen st wherest[si[ = sit; rti[count[ < sti[count[ and QItRQISst[groupID[8: if stiis not null then9: release qitqcnt(st[groupID[; QIt)10: insert qitqcntinto QIT11: qcnt = qcnt 112: update rtiinto (rti[groupID[; rti[si[; rti[count[ 1)13: else14: release qitqcnt(gcnt; QIt)15: insert qitqcntinto QIT16: qcnt = qcnt 117: let LST be the l-diverse set of pairs (si; count) returnedby l-diverse_ST_Generation (sit, l)18: for all lst LSTdo19: release stscnt(gcnt; lst[si[; lst[count[)20: insert stscntinto ST21: if lst[si[ = sitthen22: insert rtscnt(gcnt; lst[si[, 1) into RT23: else24: insert rtscnt(gcnt; lst[si[, 0) into RT25: end if26: scnt = scnt 127: end for28: gcnt = gcnt 129: end if30:end whileS. Kim et al. / Journal of Biomedical Informatics 50 (2014) 95106 99Function l-DIVERSE_ST_GENERATION(sit; l)1: let SV be the domain of sensitive values2: let LST be a set of (si; count) pairs3: insert (sit; 1) into LST4: while [LST[ < l or [[LST[[=l < lst[count[ where lst LSTdo5: let sv be a randomly chosen element from SV6: let lstibe an element of LST where lsti[si[ = sv7: if lstiis null then8: insert (sv; 1) into LST9: else10: update lstiinto (lsti[si[; lsti[count[ 1)11: end if12: end while13: return LSTl-diverse_ST_Generation is a function that generates an LST(l-diverse ST) according to Denition 3. First, it inserts sitinto theLST (line 3) because the input tuples sensitive value should be in-cluded. Then, counterfeit sensitive values are inserted into the LST(lines510), untilitsatisestheconditionsofl-diversityfortheLST(line4). ThecounterfeitsarerandomlychosenfromtheSV,which is the domain of sensitive values. In this process, if the cho-sen sensitive value already exists in LST, the count of the value isincreasedinsteadof thesensitivevaluebeinginsertedintotheLST(line10). WeutilizethesampledpastdatasetfortheSV, aswe mentioned in Section 4.2.5. Data utility considerationsIn this section, we discuss data utility considerations in the con-text of DF. In Section 5.1, we compare the DF method and accumu-lation-based methods with respect to information loss. In Section5.2, we discuss sensitive attribute uncertainty generated by counter-feit sensitive values and devise a measure to evaluate theuncertainty.5.1. Discussion of information lossFirst, in Section 5.1.1, we present the existing metrics for mea-suring information loss. Then, in Section 5.1.2, we extend the met-ric to measure the information loss of the sensitive attribute.5.1.1. Information lossExisting information loss metrics are designed to measure thegeneralization quality of QI attributes [22]. QI attributes are classi-edintocontinuousandcategorical attributes. Theyarerepre-sented as an interval and a taxonomy tree, respectively. Theinformation loss of a continuous and a categorical attribute is com-puted as follows.ILa(acat) = [MP[ 1[M[ 1(3)ILa(acon) = Ui LiU L(4)M denotes the set of leaf nodes in the tree and MP denotes the set ofleaf nodesof thegeneralizednode. UiandLidenotetheupperboundandthelower boundof thegeneralizedinterval, respec-tively. UandLdenotethemaximumandtheminimumvalueofthe whole domain, respectively.The information loss of a tuple is dened as the average value ofeach attributes information loss. The information loss isIL = 1nXni=1ILa(ai); (5)where n is the number of QI attributes.5.1.2. Information loss considering sensitive attributeDF preserves the QI values. Hence, the above information lossmetric, which considers only the distortion of QI attributes, is notappropriatefor measuringthequalityof DF. Toverifythis, wewould liketojoinQITandST; QITandRTbyajoinkeygroupID.The schema of QIT ST with a groupID j isTAj (QI; si; count(si))Let us consider the information loss incurred when input tuple t isanonymized into TAjby DF. There might be multiple elements in TAj ,with l-diverse sensitive values. Let tAjbe an arbitrary element amongTAj ; then, IL(tAj ) is always zerobecause DF does not distort QIattributes.Instead of distorting QI values, DF inates sensitive values intoanl-diverseST. Inother words, ast(QI; si)hasbeenanonymized,theoriginal QI gainsmultiplesensitivevalues. Thismeansthatthe correlation between QI and si diminishes, because counterfeitvalues are added to the original sensitive value. Therefore, to mea-sure the quality of DF, the metric considers the ination of sensi-tivevalues. Wemeasuretheinformationlossbasedonthesizeof counterfeit values over the size of all sensitive values includingcounterfeitsandanoriginal valueinthegroup. Eachsizecorre-spondstothenumberofdistinctvaluesinagivenset. Itshouldbe noted that the size of the counterfeit sensitive values is always[STj[ 1, since there is always one original value among all the sen-sitivevaluesinagroup. Accordingtothismetric, wedenetheinformation loss of the sensitive attribute asILa(asens) = [STj 1[[STj[(6)Then, to measure the information loss of the tuple consideringthe sensitive attribute, we apply Eq. (6) to the existing informationloss metric (i.e., Eq. (5)):IL(t) =1n 1Xni=1ILa(ai) ILa(asens)" #(7)As compared with Eq. (5), Eq. (7) considers one more attribute (i.e.,the sensitive attribute).Example 2(Information loss considering sensitive attribute). Let usrevisitExample1. ThereisoneQI(24, male)thatconsistsoftwoattributes (i.e., age and sex). There are two sensitive values (Diag.Aand Diag.B) in group 1. According to Eq. (6), the information loss ofthe sensitive attribute is 1=2. Thus, according to Eq. (7), theinformation loss of the tuple is computed as121[(0 0) 12[ =16.5.2. Discussion of sensitive attribute uncertaintyAs a result of the DF anonymization, counterfeit sensitive valuesare generated. These counterfeit sensitive values affect data utilitysince they lower the statistical value of the anonymized data. Forexample,consider aquery count therecords whoseblood pres-sureishigherthan200forthepastonehour. Thequeryresultmight have some uncertainty due to the counterfeit blood pressurevalues. In order to evaluate the uncertainty of the sensitive values,we need a new measure. We dene the uncertainty as the propor-tion of counterfeit values among all generated values.Meanwhile, the objective of late validation is to reduce the num-ber of counterfeits, thus lower the uncertainty. Since late validationcan occur when a matched input tuple has arrived, the uncertaintyvariesaccordingtothearrival timeof thematchedinputtuple.Thus, the measure of the uncertainty is dependent on time.100 S. Kim et al. / Journal of Biomedical Informatics 50 (2014) 95106In what follows, we dene Sensitive Attribute Uncertainty (SAU),which is the uncertainty of sensitive values at a given time intervalT0 to T, where T0 denotes the time of the initial state of the anon-ymizationprocessandTdenotesthecurrenttime. SAUiscom-putedas(thenumberofcounterfeitsensitivevaluesfromT0toT)/(thenumber of all generatedsensitivevaluesfromT0toT).Theequationcanberepresentedbythecountvaluesof RTandST as follows, where st[count[(T) and rt[count[(T) denote the countvalueof stfromT0toTandthecountvalueof rtfromT0toT,respectively.SAU(T) =PstSTst[count[(T) PrtRTrt[count[(T)PstSTst[count[(T)(8)6. ExperimentsWe evaluated the effectiveness and the efciency of DF througha set of experiments using two datasets: the Adult dataset from theUCI repository, and the NPS dataset from HIRA (Health InsuranceReview and Assessment service in Korea).TheAdultdatasetconsistsof32,561tuplesandhas14attri-butes of which we utilized the following 9 attributes: age, educa-tion, workclass, marital, race, sex, native-country, salary, andoccupation. We classied these attributes into continuous, categor-ical, and sensitive attributes. We considered the rst sevenattributes to be QIs and the remaining two attributes to besensitive attributes. In the experiments, we merged two sensitiveattributes, salaryandoccupation, intoasingleattributesalaryoccupationtoenlargethedomainsizeof thesensitiveattribute,so that the value of the attribute is represented as {salary, occupa-tion}. (We do not consider multiple sensitive values in this paper.)The domain size of the sensitive attribute becomes 30.TheNPS(National PatientsSample) dataset consistsof elec-tronicmedical recordsof Koreanpeoplesampledwith3%rate.We analyzed 117,047 tuples, which are records of a day (14th, Sep-tember, 2011). Weutilizedthefollowing6attributes: age, sex,length of stay, location of the hospital, admission route, and disease.We considered the rst ve attributes to be QIs and the remainingdisease attribute to be a sensitive attribute.Eachof Adult-dandNPS-ddenotes asubset of thedatasetwhere the rst d attributes among the original attributes are con-sidered as QIs. Table 1 and Table 2 show the type and the numberof the distinct values of the attributes. The experiments were con-ducted on a computer equipped with an Intel i3 3.07 GHz CPU and8 GB RAM.6.1. Comparative studyWe measured the average processing time per tuple (APTT) andinformation loss in order to compare the DF method with accumu-lation-based methods. The average processing time per tuple is theaverage elapsed time from the input of the tuples to their release.We implemented an accumulation-based method based on CASTLE[7], which is a typical accumulation-based method. CASTLE has animportantparameterd, whichrepresentsadelayconstraint, i.e.,the number of tuples that would be accumulated for the anonymi-zation. The value of d varies from 10 to 10,000, and we represent itas an alphanumeric ABM-n, which signies an accumulation-basedmethod withd = 10n. For example, ABM-2 indicates the accumu-lated-based method with d of 102. For reference, the default settingin CASTLE is ABM-4, i.e., d = 104. The value of l is xed to 10.Fig. 3 illustrates the average processing time per tuple of DF andtheaccumulation-basedmethods. Inthecaseof accumulation-basedmethods, theaverageprocessingtimepertupleincreasesas the value of d increases, since the accumulation continues untilthe number of input tuples reaches d. This proves that the accumu-lationgeneratesthedelay. IntheresultoftheAdultdataset, DFtakes about 0.037 ms toanonymize a tuple, regardless of thedimensionality, whichislesstimethanABM-1takes(-0.18 ms)thataccumulatesonly10tuplesforoffering10-diversity. Inthecase of the NPS dataset, the average processing time per tuple ofDF is about 0.09 ms, which is similar to that of ABM-1 (-0.8 ms).Fig. 4 presents the information loss of each method. DF showssignicantly lowvalues compared to the accumulation-basedmethods. Theinformation lossofDFdecreasesasthenumberofTable 1Attributes of Adult dataset.Attribute Type No. of distinct valuesAge Continuous 74Education Continuous 16Workclass Categorical 8Marital Categorical 7Race Categorical 5Sex Categorical 2Native-country Categorical 40(Salaryoccupation) Sensitive 30Table 2Attributes of NPS dataset.Attribute Type No. of distinct valuesAge Continuous 107Length of stay Continuous 53Sex Categorical 2Location of the hospital Categorical 16Admission route Categorical 6Disease Sensitive 4848Fig. 3. Average processing time per tuple.S. Kim et al. / Journal of Biomedical Informatics 50 (2014) 95106 101Fig. 4. Information loss.Fig. 5. Average processing time per tuple by No. of records.Fig. 6. Information loss by No. of records.Fig. 7. Effect of l.102 S. Kim et al. / Journal of Biomedical Informatics 50 (2014) 95106QIs grows. The reason is that the loss is affected only by the sensi-tive attributes, andtheir portion decreases as thedimensionalitygrows. Incontrast, theinformationloss of accumulation-basedmethodstendstoincreaseasthedimensionalitygrows, becausethe loss is affected by QIs as well as by sensitive attributes.Fig. 5 illustrates the average processing time per tuple of eachmethodforincreasinglylargersubsetsofthedatasets. InFig. 5aandb, the experiments are conductedonAdult-5andNPS-4,respectively. Both gures show that the processing time is almoststable though the number of records grows.Fig. 8. Sensitive attribute uncertainty.Fig. 9. Results of the analysis queries on the demographic data streams.S. Kim et al. / Journal of Biomedical Informatics 50 (2014) 95106 103Fig. 6 presents the information loss of each method for increas-ingly larger subsets of the datasets. The information loss is almoststable despite the variance of the number of records, because theinformationlossisalreadystabilizedintheinitial stageof theanonymization.6.2. Effects of lFig. 7 shows how the average processing time per tuple and theinformation loss change according to the value of l. We chose 5, 10,15, 20, and 25 for the various l. The average processing time per tu-ple decreases as l grows, because the computation time is reducedasalargerl generatesasmallernumberof groups. Incontrast,information loss slightly increases as l grows, because the propor-tion of counterfeits among all sensitive values increases. It is alsoexplained by Eq. (6).6.3. Sensitive attribute uncertaintyFig. 8 illustrates sensitive attribute uncertainty (SAU). The x-axisrepresents thetime, whichcorrespondsto theorder of thetupleinput. As the l value increases, SAU tends to increase. The reasonis that the larger l value leads to the generation of more counterfeitvaluesintheanonymizationprocess. InFig. 8a, SAUisunstablebefore T reaches about 7500. This means that sufcient sensitivevalues are prepared for the late validation method after T reachesabout 7500. In Fig. 8b, SAU is stabilized more gradually than thatof Fig. 8a. The reason is that late validation occurs more frequentlywhere the distinct number of sensitive values is smaller. We wouldlike to note that the NPS dataset has 4848 distinct sensitive values,while the adult dataset has 30.Without thelatevalidationmethod, rt[count[(T) isalways1.Then, we can easily predict that SAU would be nearly 1, becauseEq. (8) would be changed asPstSTst[count[(T) PrtRT1PstSTst[count[(T)However, our experiments show SAU values between 0.05 and 0.2according to the l value (after the values become stable). This meansthat the number of counterfeit sensitive values is signicantly re-duced by the late validation method.6.4. Example scenariosIn order to illustrate the usefulness of anonymized datastreams, whichareproducedbyDF, letuspresentascenarioofdata analysis. In the following scenarios, we assume two types ofdata streams (i.e., demographic data streamandmedical datastream), which continue for 10 h (9:0019:00). Each data streamwas generated from the Adult dataset and the NPS dataset, respec-tively. We anonymized the data streams using the proposed meth-Fig. 10. Results of the analysis queries on the medical data streams.104 S. Kim et al. / Journal of Biomedical Informatics 50 (2014) 95106od (l = 10), then analyzed theanonymized data streams throughmonitoring queries. The queries were executed every 6 min, thusatotal of onehundredmonitoringquerieswereconductedfor10 h. Wefurthercomparethemonitoringresultsof theanony-mized data streams with those of the original data streams.6.4.1. Monitoring occupational information over anonymizeddemographic data streamsLetusassumethatthereareresearcherswhostudyoccupa-tional information from demographic data streams. The research-ersexecutemonitoringqueriestoanalyzethedatastreams. TheSI (i.e., salaryoccupation)andtwoQIs(i.e., sexandmarital)areused inthe monitoring queries. The queries are described asfollows:v QueryD1: SELECTcount(+) FROMdemographicdatastreamWHERE salaryoccupation = 650 K, sales every 6 min.v QueryD2: SELECTcount(+) FROMdemographicdatastreamWHEREsalaryoccupation = 650 K, sales sex = female every6 min.v QueryD3: SELECTcount(+) FROMdemographicdatastreamWHERE salaryoccupation =650 K, sales sex = female andmarital = divorced every 6 min.Fig. 9 shows results of the analysis queries based on the originaldata stream and the anonymized data stream by DF. In each gure,the result of the anonymized data stream shows similar phases tothatoftheoriginaldatastream, althoughtherearetrivialerrorsbetween them. In Fig. 9ac, the average errors are 2.59, 1.47, and0.77, respectively.6.4.2. Monitoring epidemic diseases over anonymized medical datastreamsHospitalsgeneratemedicalrecordswheneverthepatientsre-ceivemedical treatments. Thesequence ofmedical records com-posesanelectronichealthdatastream. Inthisenvironment, letus assume a feasible scenario. For research purposes, the Ministryof Health and Welfare in Korea collects the data streams from hos-pitals in the entire country. Researchers of the government analyzethe data streams to monitor the occurrence of epidemic diseases inreal time. The SI (i.e., disease) and two QIs (i.e., location of the hos-pital and sex) are used in the monitoring queries. The queries forthe analysis are described as follows (nncd is the name set of dis-eases legally designated as epidemics):v Query M1: SELECT count (+) FROM medical data stream WHEREdisease in nncd every 6 min.v Query M2: SELECT count (+) FROM medical data stream WHEREdisease in nncd and location = Seoul every 6 min.v Query M3: SELECT count (+) FROM medical data stream WHEREdiseaseinnncd andlocation = Seoul andsex = Male every6 min.Fig. 10 shows results of the analysis queries based on the origi-nal data stream and the anonymized data stream by DF. In each g-ure, the result of the anonymizeddata streamshows similarphases to that of the original data stream, although there are trivialerrorsbetweenthem. InFig. 10ac, theaverageerrorsare0.58,0.47, and 0.42, respectively.7. ConclusionIn this paper, we presented a delay-free anonymization methodforpreservingtheprivacyofelectronichealthdatastreams. Theproposedmethoddoes not generateasignicant delayduringthe anonymization process and release of data streams. Since eachQI hasl-diversesensitivevalues, theprivacyof thedataispre-servedwithaprobabilityof1=l. Latevalidationsignicantlyde-creases the number of counterfeit values, and hence, theanonymization result has high data utility. Our method overcomesthedrawbacksof existingaccumulation-basedmethodsinreal-time data stream anonymization.Several future studies can be considered as an extension of thestudy reported in this paper. First, we will devise a method for gen-erating counterfeit and count values. If the result of statistical anal-ysisorpastdataareconsideredforgeneratingthevalues, morereliableanonymizationresultscanberealized. Second, itwouldbemeaningfultoinvestigate thetimingofthelatevalidation. Inthispaper, wefocusonmatchinginput tuplesandcounterfeitsforthelatevalidation. Theeffectivenesscanbeimprovedif theelapsed time of the counterfeits is considered. Additionally, we in-tend to employ the Slicing [23] instead of the Anatomy [17] tech-nique for DF anonymization. We predict that the correlationbetween the QIs and the sensitive attributes will be preserved bythe Slicing technique.Finally, we would like to design and studya distributed version of our framework. As the volume of electronichealth data streams is rapidly growing, data streams increasinglyneed to be processed in a distributed environment.AcknowledgmentsThis research was supported by Basic Science Research Programthrough the National Research Foundation of Korea (NRF) fundedby the Ministry of Education, Science and Technology (No. 2013-022884).References[1] Babcock B, Babu S, Datar M, Motwani R, Widom J. Models and issues in datastreamsystems. In: Proceedingsof thetwenty-rst ACMSIGMOD-SIGACT-SIGART symposium on principles of database systems. ACM; 2002. p. 116.[2] Abadi DJ, Carney D, etintemel U, Cherniack M, Convey C, Lee S, et al. Aurora: anewmodel andarchitecturefordatastreammanagement. IntJ VeryLargeData Bases 2003;12(2):12039.[3] Gaber MM, Zaslavsky A, Krishnaswamy S. Mining data streams: a review. ACMSigmod Record 2005;34(2):1826.[4] Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavald R. New ensemble methodsfor evolving data streams. In: Proceedings of the 15th ACMSIGKDDinternational conference onknowledge discovery anddata mining. ACM;2009. p. 13948.[5] Sweeney L. k-anonymity: a model for protecting privacy. Int J Uncertain, FuzzKnowl-Based Syst 2002;10(05):55770.[6] MachanavajjhalaA, Kifer D, GehrkeJ, VenkitasubramaniamM. l-diversity:privacy beyond k-anonymity. ACM Trans Knowl Discov Data (TKDD)2007;1(1):3.[7] Cao J, Carminati B, Ferrari E, Tan K-L. Castle: continuously anonymizing datastreams. IEEE Trans Dependable Secure Comput 2011;8(3):33752.[8] ZhouB, HanY, Pei J, JiangB, TaoY, JiaY. Continuousprivacypreservingpublishing of data streams. In: Proceedings of the 12th internationalconference on extending database technology: advances in databasetechnology. ACM; 2009. p. 64859.[9] Li J, Ooi BC, Wang W. Anonymizing streaming data for privacy protection. In:Proceedings of IEEE 24th international conference on data engineering. IEEE;2008. p. 13679.[10] WangP, LuJ, ZhaoL, YangJ. B-castle:anefcientpublishingalgorithmfork-anonymizing data streams. In: Proceedings of 2010secondWRI globalcongress on intelligent systems (GCIS), vol. 2; 2010. p. 1326.[11] Wang P, Zhao L, Lu J, Yang J. Sanatomy: privacy preserving publishing of datastreams via anatomy. In: Proceedings of the 2010 third internationalsymposium on information processing (ISIP). IEEE; 2010. p. 547.[12] Zhang J, Yang J, Zhang J, Yuan Y. Kids: K-anonymization data stream based onsliding window. In: Proceedings of the 2010 2nd international conference onfuture computer and communication (ICFCC), vol. 2; 2010. p. V2-3116.[13] ZakerzadehH, OsbornSL. Faanst:fastanonymizingalgorithmfornumericalstreaming data. In: Data privacy management and autonomous spontaneoussecurity. Springer; 2011. p. 3650.[14] Cao J, Karras P, Kalnis P, Tan K-L. Sabre: a sensitive attribute bucketization andredistribution framework for t-closeness. VLDB J 2011;20(1):5981.S. Kim et al. / Journal of Biomedical Informatics 50 (2014) 95106 105[15] LeFevre K, DeWitt DJ, Ramakrishnan R. Incognito: efcient full-domaink-anonymity. In: Proceedings of the 2005 ACMSIGMOD internationalconference on management of data. ACM; 2005. p. 4960.[16] Samarati P. Protecting respondents identities in microdata release. IEEE TransKnowl Data Eng 2001;13(6):101027.[17] Xiao X, Tao Y. Anatomy: simple and effective privacy preservation. In:Proceedingsofthe32ndinternational conferenceonverylargedatabases,VLDB endowment; 2006. p. 13950.[18] LiN, Li T, VenkatasubramanianS. t-closeness:privacybeyondk-anonymityandl-diversity. In: Proceedings of IEEE international conference ondataengineering; 2007. p. 10615.[19] ByunJ-W, SohnY, BertinoE, Li N. Secureanonymizationfor incrementaldatasets. In: Secure data management. Springer; 2006. p. 4863.[20] XiaoX, TaoY. M-invariance: towardsprivacypreservingre-publicationofdynamicdatasets. In: Proceedingsof the2007ACMSIGMODinternationalconference on management of data. ACM; 2007. p. 689700.[21] BuY, Fu AWC, Wong RCW, Chen L, Li J. Privacy preserving serial datapublishing by role composition. Proceedings of the VLDB endowment2008;1(1):84556.[22] Iyengar VS. Transforming data to satisfy privacy constraints. In: Proceedings ofthe eighth ACM SIGKDD international conference on knowledge discovery anddata mining. ACM; 2002. p. 27988.[23] Li T, Li N, Zhang J, Molloy I. Slicing: a new approach for privacy preserving datapublishing. IEEE Trans Knowl Data Eng 2012;24(3):56174.106 S. Kim et al. / Journal of Biomedical Informatics 50 (2014) 95106