Sovereign Information Sharing and Mining in a Connected World R. Agrawal Intelligent Information...
-
Upload
phebe-lloyd -
Category
Documents
-
view
222 -
download
1
Transcript of Sovereign Information Sharing and Mining in a Connected World R. Agrawal Intelligent Information...
Sovereign Information Sharing and Sovereign Information Sharing and Mining in a Connected WorldMining in a Connected World
R. AgrawalR. Agrawal
Intelligent Information Systems ResearchIntelligent Information Systems ResearchIBM Almaden Research Center, San Jose, CA 95120
Joint Work with: D. Asonov, P. Baliga, A. Evfimieviski, L. Liang, B. D. Asonov, P. Baliga, A. Evfimieviski, L. Liang, B. Porst, R. SrikantPorst, R. Srikant
OutlineOutline
Information sharing todayInformation sharing today The new worldThe new world Some solution approachesSome solution approaches Observations on privacy-preserving data miningObservations on privacy-preserving data mining Musings about the futureMusings about the future
R. Agrawal, A. Evfimievski, R. Srikant. R. Agrawal, A. Evfimievski, R. Srikant. Information Sharing Across Private DatabasesInformation Sharing Across Private Databases. SIGMOD . SIGMOD 03.03.
R. Agrawal, D. Asonov, R. Srikant. R. Agrawal, D. Asonov, R. Srikant. Enabling Sovereign Information Sharing Using Web ServicesEnabling Sovereign Information Sharing Using Web Services. . SIGMOD 04 (Industrial Track).SIGMOD 04 (Industrial Track).
R. Agrawal, D. Asonov, P. Baliga, L. Liang, B. Porst, R. Srikant. R. Agrawal, D. Asonov, P. Baliga, L. Liang, B. Porst, R. Srikant. A Reusable Platform for Building A Reusable Platform for Building Sovereign Information Sharing ApplicationsSovereign Information Sharing Applications. DIVO 04.. DIVO 04.
Assumption: Information in each database can be Assumption: Information in each database can be freely shared.freely shared.
Information Sharing TodayInformation Sharing Today
Mediator
Q R
Federated
Q R
Centralized
Need for a new style of Need for a new style of information sharinginformation sharing
Compute queries across databases so that no more Compute queries across databases so that no more information than necessary is revealed (without information than necessary is revealed (without using a trusted third party).using a trusted third party).
Need is driven by several trends:Need is driven by several trends:– End-to-end integration of information systems End-to-end integration of information systems
across companies (virtual organizations)across companies (virtual organizations)– Simultaneously compete and cooperate.Simultaneously compete and cooperate.– Security: need-to-know information sharingSecurity: need-to-know information sharing
Security ApplicationSecurity Application
Security Agency finds Security Agency finds those passengers who those passengers who are in its list of suspects, are in its list of suspects, but not the names of but not the names of other passengers.other passengers.
Airline does not find Airline does not find anything.anything.
Agency
SuspectList
Airline
PassengerList
http://www.informationweek.com/story/showArticle.jhtml?articleID=184010%79
Epidemiological Research Epidemiological Research
Validate hypothesis Validate hypothesis between adverse between adverse reaction to a drug and a reaction to a drug and a specific DNA sequence.specific DNA sequence.
Researcher should not Researcher should not learn anything beyond 4 learn anything beyond 4 counts:counts:
MedicalResearch
Inst.
DNA Sequences
DrugReactions
Adverse ReactionAdverse Reaction No Adv. ReactionNo Adv. Reaction
Sequence PresentSequence Present ?? ??
Sequence AbsentSequence Absent ?? ??
R S R must not
know that S has b & y
S must not know that R has a & x
uu
vv
RSaa
uu
vv
xx
bb
uu
vv
yy
R
S
Count (R S) R & S do not learn
anything except that the result is 2.
Minimal Necessary SharingMinimal Necessary Sharing
Problem Statement:Problem Statement:Minimal SharingMinimal Sharing
Given:Given:– Two parties (honest-but-curious): R (receiver) and S Two parties (honest-but-curious): R (receiver) and S
(sender)(sender)– Query Q spanning the tables R and SQuery Q spanning the tables R and S– Additional (pre-specified) categories of information Additional (pre-specified) categories of information II
Compute the answer to Q and return it to R without revealing Compute the answer to Q and return it to R without revealing any additional information to either party, any additional information to either party, except for the except for the information contained in information contained in II– For example, in the upcoming intersection protocolsFor example, in the upcoming intersection protocols
II = { |R| , |S| } = { |R| , |S| }
A Possible ApproachA Possible Approach
Secure Multi-Party ComputationSecure Multi-Party Computation– Given two parties with inputs x and y, compute f(x,y) such Given two parties with inputs x and y, compute f(x,y) such
that the parties learn only f(x,y) and nothing else.that the parties learn only f(x,y) and nothing else.– Can be solved by building a combinatorial circuit, and Can be solved by building a combinatorial circuit, and
simulating that circuit [Yao86].simulating that circuit [Yao86].
Prohibitive cost for database-size problems.Prohibitive cost for database-size problems.– Intersection of two relations of a million records each Intersection of two relations of a million records each
would require 144 days (Yao’s protocol)would require 144 days (Yao’s protocol)
Intersection ProtocolIntersection Protocol
RS
R S
Secret key
a b
fb(S )
Shorthand for { fb(s) | s S }
Commutative Encryptionfa(fb(s)) = fb(fa(s))
f(s,b,p) = sb mod p
R
Intersection ProtocolIntersection Protocol
S
R S
fb(S)fb(S )
fa(fb(S ))
a b
fb(fa(S ))
Commutative property
R
Intersection ProtocolIntersection Protocol
S
R
S
fa(R )
fa(R )
fb(fa(S ))
{< fa(r ), fb(fa(r ))>}
a b
<r, fb(fa(x))>
{< fa(r ), fb(fa(r ))>}
Since R knows<r, fa(r)>
Related WorkRelated Work
[Naor & Pinkas 99]: Two protocols for list [Naor & Pinkas 99]: Two protocols for list intersection problemintersection problem– Oblivious evaluation of n polynomials of degree n each.Oblivious evaluation of n polynomials of degree n each.– Oblivious evaluation of nOblivious evaluation of n22 linear polynomials. linear polynomials.
[Huberman et al 99]: find people with common [Huberman et al 99]: find people with common preferences, without revealing the preferences.preferences, without revealing the preferences.– Intersection protocols are similar Intersection protocols are similar
[Clifton et al, 03]: Secure set union and set [Clifton et al, 03]: Secure set union and set intersectionintersection– Similar protocolsSimilar protocols
Implementation: Grid of Data ServicesImplementation: Grid of Data Services
DP DBServer
meta data
DataProvider
SIS Server n
DP DBServer
meta data
DataProvider
SIS Server 1
Application
SIS Client
UserApplicationDeveloper
ClientMetadata
SIS Platform
Constructs web service query requests against multiple data providers, and collects responses.
Mapping information and data provider
access information.
Thin layer on top of the SIS client: invokes the required SIS operations, provides an interface to a SIS user.
Includes view information to retrieve data from the data
provider database, database access information, and
context information.
Provides the necessary functionality on the data provider side to enable
sovereign sharing.
Templates to aid application development
System IssuesSystem Issues
How does the application developer find the necessary data How does the application developer find the necessary data sources and their schemas? (sources and their schemas? (resource discoveryresource discovery mechanismmechanism))• Employ a UDDI registry to store and searchEmploy a UDDI registry to store and search– data providers and operations they supportdata providers and operations they support– available schemas for each data provideravailable schemas for each data provider
How does the application developer link the data between How does the application developer link the data between different providers? (different providers? (schema mappingschema mapping mechanismmechanism))• Data providers publish schemas in their own vocabularies.Data providers publish schemas in their own vocabularies.• Developers link the schemas.Developers link the schemas.
How to ensure that only eligible users can carry out the How to ensure that only eligible users can carry out the computation? (computation? (authenticationauthentication mechanismmechanism))• Authentication across multiple domainsAuthentication across multiple domains
Implementation EnvironmentImplementation Environment
Data resides inData resides in DB2 v.8.1. database systems, DB2 v.8.1. database systems, installed on 2.4GHz/ 512MB RAM Intelinstalled on 2.4GHz/ 512MB RAM Intel workstations, connected by a 100Mbit LAN network.workstations, connected by a 100Mbit LAN network.
Web services runWeb services run on top of the IBM WebSphere on top of the IBM WebSphere Application Server v.5.0 and use Application Server v.5.0 and use Apache AXIS Apache AXIS v.1.1. SOAP library for messaging.v.1.1. SOAP library for messaging.
IBMIBM private UDDI registry installed on one of the private UDDI registry installed on one of the machines.machines.
PerformancePerformance
ImplementationImplementation msms
Java programJava program 3232
Java DB2 UDFJava DB2 UDF 33-3433-34
Exponentiation time for Exponentiation time for one number (Intel P3)one number (Intel P3)
65 msMS Visual C++ (Crypto++
library)
Making Encryption Faster: Making Encryption Faster: Software ApproachesSoftware Approaches
The main component of encryption is exponentiation: The main component of encryption is exponentiation: enc(x, k, enc(x, k, p) = xp) = xkk mod p mod p
Tried custom implementations of exponentiation that used Tried custom implementations of exponentiation that used preprocessing based onpreprocessing based on– fixed exponent (k)fixed exponent (k)
– fixed base (x)fixed base (x) Fixed exponent implementation turned out to be slower than Fixed exponent implementation turned out to be slower than
the Java native implementationthe Java native implementation Fixed base is beneficial if the same value is encrypted Fixed base is beneficial if the same value is encrypted
multiple times with different keys (not useful for intersection multiple times with different keys (not useful for intersection where each value is encrypted once)where each value is encrypted once)
Making Encryption Faster: Making Encryption Faster: Hardware AcceleratorHardware Accelerator
Use SSL card to speed-up exponentiationUse SSL card to speed-up exponentiation Multiple threads (100+) must post exponentiation request Multiple threads (100+) must post exponentiation request
simultaneously to the card API to get the advertised simultaneously to the card API to get the advertised speed-upspeed-up
AEP scheduler distributes exponentiation requests AEP scheduler distributes exponentiation requests between multiple cards automatically; linear speed-upbetween multiple cards automatically; linear speed-up
Example:Example:AEP SSL CARD Runner 2000AEP SSL CARD Runner 2000≈ ≈ $2k$2k
Execution time: Encryption UDFExecution time: Encryption UDF
Encryption EngineEncryption Engine Number of rows in the tableNumber of rows in the table
1,0001,000 5,0005,000 10,00010,000
CPU Intel III 2.0 GhzCPU Intel III 2.0 Ghz 3434ss 175175ss 320320ss
AEP Runner 2000AEP Runner 2000 3.53.5ss 1919ss 3737ss
Application PerformanceApplication Performance
Encryption speed is 20K encryptions per minute Encryption speed is 20K encryptions per minute using one accelerator card ($2K per card)using one accelerator card ($2K per card)
Airline application: 150,000 (daily) passengers and Airline application: 150,000 (daily) passengers and 1 million people in the watch list:1 million people in the watch list:
120 minutes with one accelerator card120 minutes with one accelerator card 12 minutes with ten accelerator cards 12 minutes with ten accelerator cards
Epidemiological research: 1 million patient records Epidemiological research: 1 million patient records in the hospital and 10 million records in the in the hospital and 10 million records in the Genebank:Genebank:
37 hours with one accelerator cards37 hours with one accelerator cards 3.7 hours with ten accelerator cards3.7 hours with ten accelerator cards
Current WorkCurrent Work
Use of secure coprocessors to addressUse of secure coprocessors to address– Richer join operationsRicher join operations– PerformancePerformance– Semi-dishonestySemi-dishonesty
Incentive compatibility and auditing to address Incentive compatibility and auditing to address maliciousnessmaliciousness
IBM 4764cryptographic coprocessor
Privacy Preserving Data Mining: Privacy Preserving Data Mining: The Randomization ApproachThe Randomization Approach
To hide original values xTo hide original values x11, x, x22, ..., x, ..., xnn
– from probability distribution X (unknown)from probability distribution X (unknown)
we use ywe use y11, y, y22, ..., y, ..., ynn
– from probability distribution Yfrom probability distribution Y Problem: GivenProblem: Given
– xx11+y+y11, x, x22+y+y22, ..., x, ..., xnn+y+ynn
– the probability distribution of Ythe probability distribution of Y Estimate the probability distribution of X.Estimate the probability distribution of X. Use the estimated distribution of X to build the classification Use the estimated distribution of X to build the classification
modelmodel Extended subsequently to mining Association rules while Extended subsequently to mining Association rules while
preserving the privacy of individual transactionspreserving the privacy of individual transactionsR. Agrawal, R. Srikant. R. Agrawal, R. Srikant. Privacy Preserving Data MiningPrivacy Preserving Data Mining. SIGMOD 00.. SIGMOD 00.
A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Privacy Preserving Mining of Association RulesPrivacy Preserving Mining of Association Rules. . SIGKDD 02. SIGKDD 02.
Distributed SettingDistributed Setting
Application scenario: A central server interested in building a Application scenario: A central server interested in building a data mining model using data obtained from a large number of data mining model using data obtained from a large number of clients, while preserving their privacyclients, while preserving their privacy– Web-commerce, e.g. recommendation serviceWeb-commerce, e.g. recommendation service
Desiderata:Desiderata:– Must not slow-down the speed of client interactionMust not slow-down the speed of client interaction– Must scale to very large number of clientsMust scale to very large number of clients
During the application phase During the application phase – Ship model to the clientsShip model to the clients– Use oblivious computationsUse oblivious computations
Implication:Implication:– Action taken to preserve privacy of a record must not depend on Action taken to preserve privacy of a record must not depend on
other recordsother records– Fast, per-transaction perturbation (potential loss in accuracy)Fast, per-transaction perturbation (potential loss in accuracy)
Inter-Enterprise SettingInter-Enterprise Setting
A party has access to all the records in its databaseA party has access to all the records in its database– Considerable increase in available optionsConsiderable increase in available options
Cryptographic approachesCryptographic approaches– Lindell & Pinkas [Crypto 2000]Lindell & Pinkas [Crypto 2000]– Purdue Toolkit [Clifton et al 2003]Purdue Toolkit [Clifton et al 2003]
Global approaches (e.g. swapping) from SDCGlobal approaches (e.g. swapping) from SDC Model combination and VotingModel combination and Voting
– Potential for leakage from individual modelsPotential for leakage from individual models
Tradeoff between Generality, Performance, Accuracy, andPotential disclosure: Not Well understood
OutlookOutlook
Three stages of Network eraThree stages of Network era**
– Brochure stage (informational websites)Brochure stage (informational websites)
– Transaction stage (e-commerce, online banking, etc.)Transaction stage (e-commerce, online banking, etc.)
– E-business on demand (integrate business processes within and E-business on demand (integrate business processes within and with external parties; dynamic virtual organizations)with external parties; dynamic virtual organizations)
The on demand era is presenting research opportunities for The on demand era is presenting research opportunities for discontinuous thinkingdiscontinuous thinking
Sovereign information sharing is one such key opportunity, Sovereign information sharing is one such key opportunity, but challenges abound:but challenges abound:– Fast, scalable, and composable protocolsFast, scalable, and composable protocols
– New framework for thinking about ownership, privacy, and New framework for thinking about ownership, privacy, and security (zero-leakage model does not scale)security (zero-leakage model does not scale)
**IBM. IBM. Living in an On Demand WorldLiving in an On Demand World. October 2002.. October 2002.