Resolving membership in a study in shared aggregate genetics data

17
Resolving membership in a study in shared aggregate genetics data David W. Craig, Ph.D. Investigator & Associate Director Neurogenomics Division [email protected]

description

Resolving membership in a study in shared aggregate genetics data. David W. Craig, Ph.D. Investigator & Associate Director Neurogenomics Division [email protected]. Genome-wide Association Studies. - PowerPoint PPT Presentation

Transcript of Resolving membership in a study in shared aggregate genetics data

Page 1: Resolving membership in a study in shared aggregate genetics data

Resolving membership in a study in shared aggregate genetics data

David W. Craig, Ph.D.Investigator & Associate DirectorNeurogenomics [email protected]

Page 2: Resolving membership in a study in shared aggregate genetics data

Genome-wide Association Studies

Nature Reviews Genetics

Genome-wide Association Studies (GWAS) genotype millions of Single Nucleotide Polymorphisms (SNPs) across 1000’s of individuals.

SNPs are typically biallic and diploid: CC/CT/TT 00/01/11

Due to ancestral meiotic recombination, SNPs are not independent from neighboring variants. They are often in linkage disequilibrium.

The concept of LD means that a SNP may be associated with disease, due to underlying correlation with a different functional variant.

Summary stats for a SNP across hundreds/thousands of individuals:

33% C / 77% T for cases and 45% C / 55% T P=10-8

CC=508 / CT=250 / TT= 108 OR=1.8

Page 3: Resolving membership in a study in shared aggregate genetics data

Resolving Identity from aggregate genetics data

GWAS are expensive, requiring genotyping of 1000’s of individuals.

Often require consortiums of consortiums. Sharing individual-level data was and is a

challenge. Sharing meta-data is a reasonable option. In 2007, summary allele frequency and

genotype counts were routinely placed on the web for all SNPs.

In 2008, after broad deliberation with the scientific community we published a forensics paper showing that one could have crude estimates of allele frequency, yet still resolve individuals.

Resolve is the term we purposely use. Identify has multiple meanings, particularly in GWAS study

Page 4: Resolving membership in a study in shared aggregate genetics data

Example Aggregate Data

rs903252 25% 26% rs232323 15% 15% rs323555 29% 29% rs232343 73% 75% rs233432 21% 22% rs234312 5.1% 5.1% rs163232 3.1% 2.8% rs8392731 15% 16% rs238764 7.3% 7.1% rs383745 45% 54%

% A allele~500 cases

% A allele~500 controls

Other SNP Aggregate Data Types: Genotypes, odds ratios, p-values, etc.

Page 5: Resolving membership in a study in shared aggregate genetics data

Visual example (SNP data as visualized)

AA=1.0

AB=0.5

BB= 0

250,000 pixels

Page 6: Resolving membership in a study in shared aggregate genetics data

Merge 96 independent data images equally

Page 7: Resolving membership in a study in shared aggregate genetics data

After merging, individual images still resolvable

No Adjustment Auto Contrast & Smooth Filter

Page 8: Resolving membership in a study in shared aggregate genetics data

Conceptual Approach

Rs903252 25% 35% 100% +10 Rs232323 15% 13% 50% -2 Rs323555 29% 39% 100% +10 Rs232343 73% 51% 0% +22 Rs233432 21% 32% 100% +11 Rs234312 5% 15% 50% +10 Rs163232 3% 0% 0% +3….. ….. ….. ….. …..

Data Set of Question

Person Of Interest

Directionalscore

Reference Data SetSNP

Page 9: Resolving membership in a study in shared aggregate genetics data

Reference Data Set

Rs903252 25% 35% 100% +10 Rs232323 15% 13% 50% -2 Rs323555 29% 39% 100% +10 Rs232343 73% 51% 0% +22 Rs233432 21% 32% 100% +11 Rs234312 5% 15% 50% +10 Rs163232 3% 0% 0% +3….. ….. ….. ….. …..

Data Set of Question

Person Of Interest

DirectionalscoreSNP

Equations (one approach of many!!)

Page 10: Resolving membership in a study in shared aggregate genetics data

Resolving Individuals in Aggregate Data Sets

Page 11: Resolving membership in a study in shared aggregate genetics data

Results on pooled samples

Page 12: Resolving membership in a study in shared aggregate genetics data

Impact

NIH policy was changed Summary-level data is no longer freely

available on the web in a distributed unrestrictive manner.

Additional papers refined the math and described limitations

Page 13: Resolving membership in a study in shared aggregate genetics data

Managing Risk

Distributing results of studies on human subjects inherently increases the the risk of a person being identifiable..

Context is important. The concept of Positive Predictive Value (PPV) can provide a measure.

PPV can also account for ‘at-risk’ populations. Currently, working with NIH on guidance for

measuring risk with a given dataset The approaches leveraged a critical concept of

directionality, specific to genotype data and frequency tables.

P-values represent a fundamentally different datatype with low information content

Page 14: Resolving membership in a study in shared aggregate genetics data

A new era

Page 15: Resolving membership in a study in shared aggregate genetics data

The era of whole-genome sequencing is approaching

SNPs are common and usually defined as greater than 1%

Whole-genome sequencing and exome sequencing inherently measure rare variants.

Rare variants can be highly informative, particularly in combination.

Approaches need to be explored for summarizing results without revealing identity.

Page 16: Resolving membership in a study in shared aggregate genetics data

Acknowledgements

Lab Jennifer Dinh Szabolcs Szelinger Holly Benson Meredith Sanchez-Castillo Brooke Hjelm

Informatics Nils Homer, Ph.D. Tyler Izatt Jessica Aldrich Alexis Christoforides Ahmet Kurdoglu James Long Shripad Sinari

FundingNINDS U24NS051872State of ArizonaNHGRI U01HG005210This work: ENDGAME (NHLBI U01 HL086528 )

Page 17: Resolving membership in a study in shared aggregate genetics data

Thank you