Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe...
Transcript of Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe...
![Page 1: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/1.jpg)
Genomic Privacy:Limits of Individual Detection
in a PoolSriram Sankararamana, Guillaume Obozinskib, Michael I.
Jordanc, Eran Halperind
a Harvard Medical Schoolb INRIA
c UC Berkeleyd Tel Aviv University
![Page 2: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/2.jpg)
GWAS: Genomewide Association Studies
0 1 1 0 0 0 1 0 0 1 0 10 1 1 1 0 0 0 0 1 0 1 00 1 1 1 0 0 1 0 0 1 0 01 1 0 0 0 0 1 1 0 0 0 1
Cas
es
SNP
0 1 0 0 0 0 1 0 0 1 1 00 1 0 1 0 0 0 0 1 1 0 00 1 0 1 1 1 1 1 0 0 0 10 1 0 0 1 0 1 1 0 0 0 1
Con
trol
s
![Page 3: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/3.jpg)
GWAS factsLooking for common SNPs
Frequency above 1%
Chosen to be correlated to unobserved causal variants.
Most of these SNPs have low effect sizes.
Testing about million SNPs
Bottomline : Need a large number of samples to have sufficient power.
![Page 4: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/4.jpg)
GWAS so far
600 studies covering around 150 traits (Manolio, 2010)
Power can be increased by combining data from multiple studies.
Tens to hundreds of thousands of participants are common.
Rheumatoid Arthritis (5K cases, 17K controls), Alzheimers’ (7K, 14K), lipid levels and cholesterol (~100K).
Has led to setting up of central data-sharing repositories such as dbGap, EGP archive, WTCCCC.
What about individual privacy ?
![Page 5: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/5.jpg)
Some views on privacy and sharing
5
Give up privacy assurances e.g. PGP
Have streamlined procedures to regulate access to data.
The middle ground ?
Separate individual-level and summary data.
Make summary data public.
![Page 6: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/6.jpg)
DTC and genomic privacy
From the 23andMe website:23andMe may collaborate with external parties. Under this informed consent, external parties will only have access to pooled data stripped of identifying information. 23andMe will never release your individual-level data to any third party without asking for and receiving your explicit authorization to do so.
6
![Page 7: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/7.jpg)
Do these measures guarantee privacy of participants ?
7
![Page 8: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/8.jpg)
Individual Detection in a Pool
8
0 1 1 0 0 0 1 0 0 1 0 10 1 1 1 0 0 0 0 1 0 1 00 1 1 1 0 0 1 0 0 1 0 01 1 0 0 0 0 1 1 0 0 0 1
Cas
esSNP
0.25 1 0.75 0.5 0 ................................ 0.5 0.25 0.5
0 1 0 0 0 0 1 0 0 1 1 00 1 0 1 0 0 0 0 1 1 0 00 1 0 1 1 1 1 1 0 0 0 10 1 0 0 1 0 1 1 0 0 0 1
Con
trol
s
0 1 0 0.5 0.5 ................................ 0.5 0.25 0.5
![Page 9: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/9.jpg)
Individual Detection in a Pool
9
0 1 1 0 0 0 1 0 0 1 0 10 1 1 1 0 0 0 0 1 0 1 00 1 1 1 0 0 1 0 0 1 0 01 1 0 0 0 0 1 1 0 0 0 1
Cas
esSNP
0.25 1 0.75 0.5 0 ................................ 0.5 0.25 0.5
0 1 0 0 0 0 1 0 0 1 1 00 1 0 1 0 0 0 0 1 1 0 00 1 0 1 1 1 1 1 0 1 0 10 1 0 0 1 0 1 1 0 0 0 1
Con
trol
s
0 1 0 0.5 0.5 ................................ 0.5 0.25 0.5
0 1 1 1 0 0 0 0 1 0 1 0 : Is this in the case ?
![Page 10: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/10.jpg)
High-density SNP arrays can be used to resolve DNA mixtures
Homer et al, PLoS Genetics,2008
10
![Page 11: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/11.jpg)
Identification in Pools
11
NIH and others removed summary data.
![Page 12: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/12.jpg)
Identification in Pools
12
NIH and others removed summary data.
Need a mathematical model of privacy.
![Page 13: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/13.jpg)
Forensics vs Privacy
Forensics: Given data, choose a procedure to maximize power.
Privacy: Select data to expose such that the maximum power attained by an adversary is small.
13
![Page 14: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/14.jpg)
Forensics vs Privacy
Forensics: Given data, choose a procedure to maximize power.
Privacy: Select data to expose such that the maximum power attained by an adversary is small. Bounds matter.
14
![Page 15: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/15.jpg)
Limits of Individual Detection
Formulate individual detection in a pool as a hypothesis testing problem.
15
![Page 16: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/16.jpg)
Limits of Individual Detection
Formulate individual detection in a pool as a hypothesis testing problem.
Likelihood-Ratio test (LR-test) is optimal for this hypothesis test (Neyman-Pearson lemma)
16
L(x) =Pr(x|H1)Pr(x|H0)
! t(!)
Pr(L(x) ! t(!)|H0) = !
![Page 17: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/17.jpg)
Limits of Individual Detection
Formulate individual detection in a pool as a hypothesis testing problem.
Likelihood-Ratio test (LR-test) is optimal for this hypothesis test.
The power of the LR-test provides an upper bound on the power of any method.
17
![Page 18: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/18.jpg)
Limits of Individual Detection
xXi
p
p
n! 1
xXi
p
p
n
Null Alternative
L =m!
j=1
"xj log
pj
pj+ (1! xj) log
1! pj
1! pj
#
18
Likelihood-ratio test
![Page 19: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/19.jpg)
L = x log!
p
p
"+ (1! x) log
!1! p
1! p
"
" (x!p)(p!p)p(1! p)
! 12
(x!p)2(p!p)2
p2(1! p)2
" 1#n
x! p#p(1!p)
Z ! 12n
(x!p)2
p(1!p)Z2.
E0[L] = ! 12n
, V0(L) " 1n
E1[L] " +12n
, V1(L) " 1n
What happens for large pools?
a < p < 1! a, a > 0Need SNPs to be common
19
![Page 20: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/20.jpg)
Main Result
20
z! + z1!" !!
m
n
1-! "
µ0 µ1
Null Alternative
z!#0 z1-"#1
![Page 21: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/21.jpg)
Main Result
21
1-! "
µ0 µ1
Null Alternative
z!#0 z1-"#1
log !, " = 10 ! ! mn
-2.0000 1.0916-3.0000 0.5835-4.0000 0.3954-5.0000 0.2980-6.0000 0.2387-7.0000 0.1988-8.0000 0.1703-9.0000 0.1488-10.0000 0.1322
![Page 22: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/22.jpg)
Can we apply the LR-test in practice?
Use a leave-one out procedure on a dataset to obtain empirical power estimates.
Requires an estimate of the population allele frequencies.
Use an independent reference dataset.
22
L =m!
j=1
"xj log
pj
pj+ (1! xj) log
1! pj
1! pj
#
![Page 23: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/23.jpg)
Can we apply the LR-test in practice?
Requires an estimate of the population allele frequencies.
Use an independent reference dataset.
Drop in power.
Use a leave-one out procedure on a dataset to obtain empirical power estimates.
z! + z1!" !!
mn (1" n
n )
23
![Page 24: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/24.jpg)
Analysis and empirical estimates agree for large pools.
−3 −2 −1 00
0.2
0.4
0.6
0.8
1WTCCC
False positive rate (Log base 10)
Pow
er
−3 −2 −1 00
0.2
0.4
0.6
0.8
1Simulated data
False positive rate (Log base 10)
Pow
er
LRLR theoryHomer et al
m=10000 m=10000
m=1000
m=33138
m=1000
24
![Page 25: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/25.jpg)
Why does our optimal test have lower power than Homer et al?
Alternative hypothesis is the same.Tested individual is present in pool.
Nulls differ.Our null: Tested individual is sampled from the population and is not part of the reference dataset.
Null tested in Homer et al: Tested individual is part of the reference dataset.
25
![Page 26: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/26.jpg)
Does this difference in the nulls matter?
Population has 10 individuals of which 5 are in the pool and rest in the reference.
Easy to detect individual in pool or reference.
Population has 1 million individuals of which 5 are in the pool.
Harder to detect in reference.
Even harder if only 5 out of these 1 million are available in a reference dataset.
26
![Page 27: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/27.jpg)
Does this difference in the nulls matter?
Population has 10 individuals of which 5 are in the pool and rest in the reference.
Easy to detect individual in pool or reference.
Population has 1 million individuals of which 5 are in the pool.
Harder to detect in reference.
Even harder if only 5 out of these 1 million are available in a reference dataset.
Null tested in Homer et al. more appropriate for forensics. Our null more appropriate for privacy.
27
![Page 28: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/28.jpg)
The other null is indeed easier to test.
−3 −2.5 −2 −1.5 −1 −0.5 00
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False positive rate
Pow
er
Homer et alLRLR theory
−3 −2.5 −2 −1.5 −1 −0.5 00
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False positive rate
Pow
er
Homer et alLRLR theory
28
Our null requires 4 times more independent SNPs to achieve the same power.
![Page 29: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/29.jpg)
Related questions
Dependent SNPs: Slight decrease in power. Haplotype-based test can be more powerful.
Genotyping errors: Reduces power.
Relatives: Requires more SNPs.
Population-independent.
1!2
29
![Page 30: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/30.jpg)
Xi ! X ,X = {0, 1}n
(X1, . . . , Xm)! f(X1, . . . , Xm)
An alternative framework
30
![Page 31: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/31.jpg)
f = f + !
An alternative framework
31
Release noisy version of fMust still be useful for a non-attacker.An attacker cannot used this sanitized f to learn about X.
![Page 32: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/32.jpg)
Differential Privacy
Relates to the LR test.
Given a test with false positive rate
Power at most
32
Pr(f(X) ! S)Pr(f(Y ) ! S)
" exp!
"(X, Y ) = 1
! exp"!
Dwork et al , 2006
![Page 33: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/33.jpg)
Exponential mechanism
33
!(") ! exp(" #"
S(f))
![Page 34: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/34.jpg)
Exponential mechanism
What is f ? Say the mean frequencies of the allele frequencies.
34
!(") ! exp(" #"
S(f))
S(f) = sup{x,y:!(x,y)=1}||f(x)! f(y)||1
![Page 35: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/35.jpg)
Exponential mechanism
What is f ? Say the mean frequencies of the allele frequencies.
What is S(f) ? O (number of SNPs)
Bad news : The standard deviation of noise is proportional to the number of SNPs.
35
!(") ! exp(" #"
S(f))
S(f) = sup{x,y:!(x,y)=1}||f(x)! f(y)||1
![Page 36: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/36.jpg)
Conclusions
A statistical framework to analyze the limits of genotype detection in pools.
Provides guidelines on data sharing to researchers.
The analytical bound is valid for large pools and common SNPs.
Use in conjunction with the empirical test.
36
![Page 37: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,](https://reader035.fdocuments.us/reader035/viewer/2022070810/5f08f4ed7e708231d4248ab5/html5/thumbnails/37.jpg)
Future Directions
37
Identity
PhenotypeGenotype