Privacy Preserving Record Linkage with PPJoin

31
Privacy Preserving Record Linkage with PPJoin Ziad Sehili, Lars Kolb, Christian Borgs, Rainer Schnell, Ergard Rahm Datenbanksysteme für Business, Technologie und Web (BTW), 2015 September 15, 2016 Presentation by Mateus Cruz

Transcript of Privacy Preserving Record Linkage with PPJoin

Page 1: Privacy Preserving Record Linkage with PPJoin

Privacy Preserving RecordLinkage with PPJoin

Ziad Sehili, Lars Kolb, Christian Borgs,Rainer Schnell, Ergard Rahm

Datenbanksysteme für Business, Technologie und Web (BTW), 2015

September 15, 2016Presentation by Mateus Cruz

Page 2: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

OUTLINE

1 Introduction

2 Preliminaries

3 Proposal

4 Experiments

5 Conclusion

Page 3: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

OUTLINE

1 Introduction

2 Preliminaries

3 Proposal

4 Experiments

5 Conclusion

Page 4: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

OVERVIEW

Find pairs of similar recordsQuadratic complexity

Ï Scalability problems

Adapt PPJoin1 to encrypted dataÏ Filtering reduces search space

Parallelize to improve performanceÏ GPUs

1Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu: “EfficientSimilarity Joins for Near Duplicate Detection”, WWW 2008

1 / 22

Page 5: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

OUTLINE

1 Introduction

2 Preliminaries

3 Proposal

4 Experiments

5 Conclusion

Page 6: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

DATA REPRESENTATION

Create Bloom filters using MD5 and SHA-1Ï Similarity preservingÏ Allows length filtering

2 / 22

Page 7: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

DATA REPRESENTATION

Create Bloom filters using MD5 and SHA-1

Deterministic

Ï Similarity preservingÏ Allows length filtering

2 / 22

Page 8: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

PPJOIN2

Position Prefix JoinSignature-based algorithmFiltering techniques

Ï Length filterÏ Prefix filterÏ Position filter

2Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu: “EfficientSimilarity Joins for Near Duplicate Detection”, WWW (2008)

3 / 22

Page 9: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

LENGTH FILTER

“If two records are similar, the differencebetween their lengths cannot be large”

Sort records by lengthUsing Jaccard similarity: δ|s| ≤ |r| ≤ |s|

δÏ |s|: Length of sÏ δ: Similarity threshold

Group records according to their lengthsÏ Prune pairs of records from different groups

4 / 22

Page 10: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

PREFIX FILTER

“If two records are similar,they must share some tokens”

Sort tokens in each recordÏ Alphabetical order, IDF order, etc

Select the p first tokensÏ For JS, p = b(1−δ)|s|c+1

Prune pairs for which sp ∩ rp 6= ;Ï sp: prefix of s (containing the first p tokens)

5 / 22

Page 11: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

POSITION FILTER

“If two records are similar, their maximal overlapis smaller than the minimally needed overlap”

Minimal overlapÏ α= d t

1+t ∗ (|r|+ |s|)eDivide each record into left and right parts

Ï lp: tokens already seenÏ rp: unseen tokens

Prune if |lp(r)∩ lp(s)|+min(|rp(r)|, |rp(s)|) <α

6 / 22

Page 12: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

PPJOIN PREPROCESSING

7 / 22

Page 13: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

PPJOIN INDEX

Pair (r1,r4) filtered by length filterÏ |r1| < δ∗|r4| (4 < 0.8∗6)

Pair (r3,r2) filtered by position filter

8 / 22

Page 14: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

OUTLINE

1 Introduction

2 Preliminaries

3 Proposal

4 Experiments

5 Conclusion

Page 15: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

P4JOIN

“PPJoin for Encrypted Data” (P4Join)Records are BFs of fixed sizeConsider bit positions as tokensLength is the number of 1 bits

Ï Called cardinality

Does not need an inverted index

9 / 22

Page 16: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

P4JOIN PREPROCESSING

Length is the number of 1 bits (cardinality)Ï Prefixes with same lengths, but different sizes

10 / 22

Page 17: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

P4JOIN PROCESSING

High cost to maintain inverted indexOriginal position filter reduces performancelmap

Ï Lists relevant records based on length filter

11 / 22

Page 18: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

P4JOIN LENGTH FILTER

r1 does not satisfy the length filterÏ 7 < 0.8∗11

12 / 22

Page 19: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

P4JOIN PREFIX FILTER

Check overlap by AND operationÏ Prune pair (r4,r2)

– 000011011 AND 1111 = 000000000

13 / 22

Page 20: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

P4JOIN POSITION FILTER

Prune pair (r4,r3)

14 / 22

Page 21: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

P4JOIN WITH GPUS

Bit arrays of type long (64 bits)Divide R and S into partitions

Ï To fit in the GPU’s memory

Sort records in partitionsCheck if partitions have candidate pairs

Ï Using length filterÏ If no candidates, do not even send to GPU

15 / 22

Page 22: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

P4JOIN GPU PROCESSING

One kernel per record of RiÏ Comparing with all records from SjÏ Prune using length and prefix filtersÏ Matches are saved in the global memory

16 / 22

Page 23: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

OUTLINE

1 Introduction

2 Preliminaries

3 Proposal

4 Experiments

5 Conclusion

Page 24: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

SETUP

HardwareÏ CPU 4-core 2.67 GHz, 4GB memoryÏ GPUs

– NVIDIA GeForce GT 610 (1GB memory)– NVIDIA GeForce GT 540M (1GB memory)

ParametersÏ Bigrams as tokensÏ Bit vector length: 1000Ï JS threshold: 0.8Ï Number of hash functions k = 20Ï Partitions maximum size: 2000

17 / 22

Page 25: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

CPU PERFORMANCE

Most gains from length filterLarge overhead for prefix filter

18 / 22

Page 26: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

GPU PERFORMANCE

Speedups of 20%Ï Compared to sequential CPU approach

19 / 22

Page 27: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

OUTLINE

1 Introduction

2 Preliminaries

3 Proposal

4 Experiments

5 Conclusion

Page 28: Privacy Preserving Record Linkage with PPJoin

Introduction Preliminaries Proposal Experiments Conclusion

SUMMARY

Adaptation of PPJoin to PPRLÏ Records are encrypted bit arrays

Parallelization using GPUsBit arrays reduce effectiveness of filters

Ï Due to overheads

20 / 22

Page 29: Privacy Preserving Record Linkage with PPJoin

Algorithms Detailed Filters

EXTRA SLIDES

Page 30: Privacy Preserving Record Linkage with PPJoin

Algorithms Detailed Filters

P4JOIN ALGORITHM

Page 31: Privacy Preserving Record Linkage with PPJoin

Algorithms Detailed Filters

POSITION FILTER“If two records are similar, the upper bound of

their JS cannot be smaller than the threshold δ”Compute prefixes sp and rp

Calculate the upper bound of their JS (Θ):Ï Θ= |sp∩rp|+min(|s|−|sp|,|r|−|rp|)

|sp∪rp|+max(|s|−|sp|,|r|−|rp|)Prune the pair if Θ< δ

Exampler = {B,C,D,E,F},s = {A,B,C,D,F}, δ= 0.8 a

Θ= 1+33+3 = 4

6 ≈ 0.7 → prune pair (r,s)

aExample from Jiang et al.: “String similarity joins: An experimentalevaluation” VLDB (2014)