Sovereign Information Sharing, Searching and Mining Rakesh Agrawal IBM Almaden Research Center.

7
Sovereign Information Sovereign Information Sharing, Searching and Sharing, Searching and Mining Mining Rakesh Agrawal Rakesh Agrawal IBM Almaden Research Center IBM Almaden Research Center

Transcript of Sovereign Information Sharing, Searching and Mining Rakesh Agrawal IBM Almaden Research Center.

Page 1: Sovereign Information Sharing, Searching and Mining Rakesh Agrawal IBM Almaden Research Center.

Sovereign Information Sharing, Sovereign Information Sharing, Searching and MiningSearching and Mining

Rakesh AgrawalRakesh Agrawal

IBM Almaden Research CenterIBM Almaden Research Center

Page 2: Sovereign Information Sharing, Searching and Mining Rakesh Agrawal IBM Almaden Research Center.

ThesisThesis

Organizational boundaries are blurring in the Organizational boundaries are blurring in the emerging networked economyemerging networked economy– Compete and co-operate simultaneouslyCompete and co-operate simultaneously– Int’l value chainInt’l value chain

Need to rethink information sharing, searching, and Need to rethink information sharing, searching, and mining in the new brave world of virtual mining in the new brave world of virtual organizationsorganizations

Page 3: Sovereign Information Sharing, Searching and Mining Rakesh Agrawal IBM Almaden Research Center.

Separate databases due to Separate databases due to statutory, competitive, or security statutory, competitive, or security reasons.reasons. Selective, minimal sharing on Selective, minimal sharing on

need-to-know basis.need-to-know basis. Example:Example: Among those who took Among those who took

a particular drug, how many had a particular drug, how many had adverse reaction and their DNA adverse reaction and their DNA contains a specific sequence?contains a specific sequence? Researchers must not learn Researchers must not learn

anything beyond counts.anything beyond counts. Commutative Encryption:Commutative Encryption:

E1(E2(T)) = E2(E1(T))E1(E2(T)) = E2(E1(T))

Minimal Necessary Sharing

R S R must not

know that S has b & y

S must not know that R has a & x

uu

vv

RSaa

uu

vv

xx

bb

uu

vv

yy

R

S

Count (R S) R & S do not learn

anything except that the result is 2.

Sovereign Information Sharing

SIGMOD 00

Page 4: Sovereign Information Sharing, Searching and Mining Rakesh Agrawal IBM Almaden Research Center.

Privacy Preserving Data MiningPrivacy Preserving Data Mining

0

200

400

600

800

1000

1200

2 10 18 26 34 42 50 58 66 74 82

Original Randomized Reconstructed

50 | 40K | ... 30 | 70K | ...

Randomizer Randomizer

Reconstructdistribution

of Age

Reconstructdistributionof Salary

Data Mining Algorithms

Data Mining Model

65 | 20K | ... 25 | 60K | ...

Alice’s age

Alice’s salary

Bob’s age

30+35

0

20

40

60

80

100

120

10 20 40 60 80 100 150 200

Randomization Level

Original Randomized Reconstructed

Insight: Preserve privacy at the individual level, while still building accurate data mining models at the aggregate level.

Add random noise to individual values to protect privacy.

EM algorithm to estimate original distribution of values given randomized values + randomization function.

Algorithms for building classification models and discovering association rules on top of privacy-preserved data with only small loss of accuracy.

SIGMOD 00

Page 5: Sovereign Information Sharing, Searching and Mining Rakesh Agrawal IBM Almaden Research Center.

Finessing Schema ChaosFinessing Schema Chaos

0 10 20 30 40 500

10

20

30

40

50

0 10 20 30 40 500

10

20

30

40

50

1 2 3 4 5 7

Query Size

0

20

40

60

80

100

AccuracyNon-Reflectivity

Randomized Non-Reflectivity

Use a simple regular expression extractor to get numbers

Do simple data extraction to get hints

Hint for unit: the word following the number.

Hint for attribute name: k following numbers.

256 MB SDRAM memory

Unit Hint:MB

Attribute Hint:SDRAM, Memory

Use only numbers in the queries

Treat any attribute name in the query also as hint Reflectivity estimates

accuracyW W W 03

Page 6: Sovereign Information Sharing, Searching and Mining Rakesh Agrawal IBM Almaden Research Center.

Privacy Preserving IndexingPrivacy Preserving Indexing

A public mapping function that maps a query to a A public mapping function that maps a query to a set of providers P that may contain the desired set of providers P that may contain the desired documentdocument

P contains false negativesP contains false negatives Providers return a document only if the searcher is Providers return a document only if the searcher is

authorized to access the documentauthorized to access the document

VLDB 03

Page 7: Sovereign Information Sharing, Searching and Mining Rakesh Agrawal IBM Almaden Research Center.

Some Interesting TopicsSome Interesting Topics

Current integration approaches do not scaleCurrent integration approaches do not scale– Information integration per se is not interestingInformation integration per se is not interesting– Static vs. dynamic plumbingStatic vs. dynamic plumbing

Incentive compatibilityIncentive compatibility Auditing interactionsAuditing interactions