Privacy Preserving Indexing of Documents on the Network Mayank Bawa Roberto J. Bayardo Jr. Rakesh...

58
Privacy Preserving Indexing of Documents on the Network Mayank Bawa Roberto J. Bayardo Jr. Rakesh Agrawal [email protected]
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Privacy Preserving Indexing of Documents on the Network Mayank Bawa Roberto J. Bayardo Jr. Rakesh...

Privacy Preserving Indexing of Documents on

the Network

Mayank BawaRoberto J. Bayardo Jr.

Rakesh [email protected]

Sharing Private Content

• Rapid growth in Private & Semi-Private information on the network – Experimental results of drug tests– Drafts of research papers, patents,…– Architectural CAD documents

• Mechanisms to search information have failed to keep pace– Public Information: Google, Yahoo!– Private Information: ???

Talk Overview

1. Content Privacy issues in sharing access-controlled content

2. Data structure for search on access-controlled content

3. Algorithm for building such a data structure

Privacy issues in sharing access-controlled content

Provider

• Shares documents• Enforces access policy

P1

Alzheimer’s Disease (Alice, Bob)

AIDS (Alice)

Small-Pox (Alice, Bob, Lisa, …)

P1 P2 P3

P32 P2026

Searcher

• Wants documents that match her keyword query Q

• Has an identity

Alice

P1 P2 P3

P32 P2026

Q = “Amyloid Peptide”

Retrieve a Document

Alice

P2 P1 P3

P32 P2026

Q = “Amyloid Peptide”

?

Alice

Retrieve a Document

George

P2 P1 P3

P32 P2026

Q = “Amyloid Peptide”

?

George

Search Process (Today)

Alice

P2 P1 P3

P32 P2026

Q = “Amyloid Peptide”

Automating Search

A searcher s issues a query q expecting a set of documents d such that

1. d is shared by some provider p

2. d matches the query q

3. d is accessible to s as dictated by p’s access policy

Automating Search

George

P2 P1 P3

P32 P2026

Q = “Amyloid Peptide”

(Alzheimer’s Disease, Alice) ???

Content Privacy

An adversary A should not be able to deduce, using the search mechanism, that provider P is sharing document d with keywords q unless A has been granted access to d by P

An access-controlled search mechanism with content privacy

Soln #1: Document Index

P2 P1 P3

P32 P2026

Alice

Q = “Amyloid Peptide”

Inverted Index

P1

Documents

Access Policy

?Alice

Soln #1: Document Index

P2 P1 P3

P32 P2026

George

Q = “Amyloid Peptide”

Inverted Index

?George

Soln #1: Document Index

P2 P1 P3

P32 P2026

“Knows Everything”

Soln #2: Keyword Index

P2 P1 P3

P32 P2026

Alice/George

Q = “Amyloid Peptide”

Keyword Index

P1

Keywords

Soln #2: Keyword Index

P2 P1 P3

P32 P2026

Alice/George

P1 has a document with

words “Amyloid Peptide”

Keyword Index

Keyword Index

ti {p: ti d,provider(d)= p}

ExampleAmyloid {…, P1, …}Peptide {…, P1, …}

Problem Cause Every term is mapped precisely

Soln #2: Keyword Index

Intuition

Add “false positives”

Example

Amyloid {…, P1, P2,…}

Peptide {…, P1, P2,…}

Soln #3: Privacy Preserving Index

Soln #3: Privacy Preserving Index (PPI)

P2 P1 P3

P32 P2026

Alice/George

Q = “Amyloid Peptide”

Privacy Preserving Index

P1

P2

Soln #3: Privacy Preserving Index (PPI)

P2 P1 P3

P32 P2026

Alice/George

P1 or P2 may have a document

with words “Amyloid Peptide”

Privacy Preserving Index

Soln #3: Privacy Preserving Index

Privacy Preserving Index

ti M P

[A] M = only if dj:ti dj

[B] M = Ptrue Pfalse,|Pfalse| |Ptrue|

[C] M = P

Completeness, Quantifiable Privacy on Reiter-Rubin scale, Loss in Selectivity

Consistency of Behavior

1. Results for “Peptide” should tally with results from searches earlier

2. Results for “Amyloid Peptide” “Amyloid” and “Peptide” should tally

3. …

Filtering of “noise” impossible

A mechanism for constructing a Privacy Preserving Index (PPI)

Step 1: Content Vectors

01

0

Step 2:Privacy Groups

Group A Group F Group Z

Step 3:Group (OR) Vector

]1log[,3max(

10:Error

)}1(78{

)1(1

c

r

Theorem: After r rounds, the Group Vector

subsumes with prob. 1iGiV

Step 4:Global Index

P2 P1 P3

P32 P2026

Keyword Index (PPI)Group Vector

Group A

Group F

Group S

Searches

P2 P1 P3

P32 P2026Group A

Group F

Group S

Keyword Index (PPI)

Alice/George

Q = “Amyloid Peptide”

Group

F

Intuition:3.Group Vector

Group Vector is a logical OR => Members are indistinguishable

Intuition:3.Group Vector

Group Vector is a logical OR => Members are indistinguishable

Privacy size of group

Intuition:3.Group Vector

Group Vector is a logical OR => Members are indistinguishable

Privacy size of group

Search Cost size of group

Privacy vs Performance Tradeoff

Empirical Evaluation

Number of Rounds(Step 3)

]1log[,3max( )}1(78{

)1(1

c

r

Evaluation Procedure

• YouServ: Personal web-server deployed within IBM corporate intranet since 2001

• Content from 324 YouServ web-servers

• Partitioned into privacy groups of size C

• Query Set consisting of 100 queries chosen randomly from YouServ query logs

Loss in Recall

Summary

• Searches on access-controlled data– Privacy Preserving Indexes– Randomized Construction

• Project Home– Google: Stanford Peers– Google: IBM YouServ

The End

Growing Privacy Concerns

• Popular Press– Economist: The End of Privacy(’99)– Time: The Death of Privacy(’97)

• Govt. Directives/Commissions– European Union Directive on Privacy Protection(’98)

– Canadian Personal Information Protection Act(’01)

Context

“The misuse of subpoena process by an adult entertainment company emphasizes the potential for abuse with insufficient privacy protections in the law.”

--- Cindy Cohen(Legal Director, Electronic Frontier Foundation)

Context

“Better support for anonymity and privacy is sorely needed […] amid the RIAA’s campaign to subpoena information about customers.”

--- Wendy Seltzer

(Staff Attorney, Electronic Frontier Foundation)

Growing Privacy Concerns

In 07/2003, the RIAA began filing - at the rate of 75 or more per day – DMCA Section 512(h) subpoenas to force ISPs to identify file sharers.

DMCA 512(h) subpoenas are issued without prior judicial review […and so…] may be used to obtain identity information in cases where there is no copyright infringement.

Growing Privacy Concerns

• Unfair Walmart/KMart against a customer who posted their prices at a comparison-shopping site

• Errors RIAA against Prof. Usher at Penn State Dept. of Astronomy & Astrophysics [+dozen other cases]

• Vested A person against ISPs to erase record of his past messages

• Others Against Internet Archive,…

Automating Search

Alice

P2 P1 P3

P32 P2026

Q = “Amyloid Peptide”

(Alzheimer’s Disease, Alice)

Adversary

Passive (observes sent messages: queries, responses, indexes)

Active (acts deliberately: searcher, provider, indexer)

Global/Local view

Collude/Independent actions

Absolute

Privacy

Provable

Exposure

Quantifying Privacy

0 1/2 1

Probabilistic Scale [Rei98]

Search Methodology

Privacy Preserving Index

ti M P

[A] M = only if dj:ti dj

[B] M = Ptrue Pfalse,|Pfalse| |Ptrue|

[C] M = P

Loss in Selectivity |Pfalse|/|Ptrue| for [B]; at most 2 for [C]

Search Methodology

Privacy Preserving Index

ti M P

[A] M = only if dj:ti dj

[B] M = Ptrue Pfalse,|Pfalse| |Ptrue|

[C] M = P

Correctness No true positives excluded; provider enforces access control

Search Methodology

Privacy Preserving Index

ti M P

[A] M = only if dj:ti dj

[B] M = Ptrue Pfalse,|Pfalse| |Ptrue|

[C] M = P

Privacy All providers equivalent in [A,C]

0 1/2 1

[B]

3.Constructing OR Vector

Group F outi

ii

ini

ii

ii

PprobwithB

Bbifelse

PprobwithB

Bbifelse

nopBbif

. 0

)10(

. 1

)01(

)(

inout

in

PP

P

Start

1 2

1

: iBib

3.Constructing OR Vector

Group F outi

ii

ini

ii

ii

PprobwithB

Bbifelse

PprobwithB

Bbifelse

nopBbif

. 0

)10(

. 1

)01(

)(

inout

inin

PP

PP

RoundEvery

1 2

: ib iB

Construction Properties

Completeness: For any query q, the result set Mq contains all providers that share documents matching q

Correctness: The mapping Mq is expected to be a Privacy Preserving Index

Construction Properties

Privacy: Within a privacy group G, an active adversary can only breach its neighbor’s privacy with probability < 0.71 (Possible Innocence)

0 1/2 1

Data Characteristics

Selectivity of a Term

Related Work

• Private Information Retrieval– Information theoretic privacy– Inefficient for keyword searching

• Secure Databases– Single trusted data host

• Anonymity Channels– Source of message to be anonymous

• Secure Multi-Party/Coprocessors