Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

50
1 1 Integrative analysis in 1000 Genomes data Fuli Yu BioInfoSummer, Adelaide Australia 2012

description

1000 Genomes - A deep catalog of Human Genetic Variation

Transcript of Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

Page 1: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

1 1

Integrative analysis in 1000 Genomes data

Fuli Yu

BioInfoSummer, Adelaide Australia

2012

Page 2: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

Outline

• Background overview of 1000G

• 1000G Phase I results

• BCM NGS variation analysis software

• Further development and timeline

2

Page 3: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

3

The history before the 1000 Genomes Project

-Phase I and II: common SNPs in CEU, CHB, JPT, YRI -HapMap3: 11 populations -Patterns of linkage disequilibrium and haplotypes defined genome-wide

www.hapmap.org

• Complex diseases gene mapping – GWAS. • Characteristics of the human genome variants: allele frequency spectrum, LD

patterns, recombination rate variation… • Population genetics: selection, migration, drift, admixture

Impacts

1,449 published

GWA at p≤5x10-8

for 237 traits

Page 4: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

4 4

Disease mutations are likely rare and heterogeneous

McClellan J and King M-C, 2010

‘Clan Genomics’ Lupski JR et al. 2011

Page 5: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

5 5

The quest for rare genetic variation

Gibbs R 2005

HapMap

1000G

Page 6: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

6 6

Project goal

“…sequence a large number of people, to provide a

comprehensive resource on human genetic variation…”

“…find most genetic variants that have frequencies of at

least 1% in the populations studies…”

www.1000genomes.org

Page 7: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

1000 Genomes Project Design and Progress

• Pilot data collected in 2008; paper published October 2010 in Nature

– Companions in Science and Genome Research

– Other companions later

• Full project data collection and analysis underway – Phase 1 results published Nov 1st 2012

– Phase 2 / Phase 3 being completed

• Sequencing completion - early 2013 – Analysis completion in 2013-2014

Page 8: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

8 8

Nature, Oct 2010

-179 WGS, 700 exon seq

-15M new SNPs

-CNV group

-Exon group

Page 9: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

1000 Genomes Project Design and Progress

• Pilot data collected in 2008; paper published October 2010 in Nature

– Companions in Science and Genome Research

– Other companions later

• Full project data collection and analysis underway – Phase 1 results published Nov 1st 2012

– Phase 2 / Phase 3 being completed

• Sequencing completion - early 2013 – Analysis completion in 2013-2014

Page 10: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

10

1000G Phase I populations

Page 11: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

Mark DePristo

Page 12: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

12

An integrative map of 40 million variants Low-pass Genomes

SNPs 38M

Low-pass Genomes Low-pass Genomes Low-pass Genomes

Low-pass Genomes

Low-pass Genomes Low-pass Genomes Low-pass Genomes Low-pass Genomes

Deep Exomes

INDELs 1.4M

SVs 14k

Integrated Genotypes ~40M

Hyun Min Kang

Page 13: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

1000 Genomes Project Design and Progress

• Pilot data collected in 2008; paper published October 2010 in Nature

– Companions in Science and Genome Research

– Other companions later

• Full project data collection and analysis underway – Phase 1 results published Nov 1st 2012

– Phase 2 / Phase 3 being completed

• Sequencing completion - early 2013 – Analysis completion in 2013-2014

Page 14: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

Discovery power

• 1% SNPs

– 99.3% genome / 99.8% exome

• 0.1% SNPs

– 70% genome / 90% exome

- Exome high r2>0.9 - with LD information, WGS genotype - improves MAF>=1% by 30-40% - unchanges MAF<0.1%

Page 15: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

Phase 1 variants are of high quality

Page 16: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

Overall genotype accuracy at ~99%

Hyun Min Kang

Page 17: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

Hyun Min Kang

Sensitivity >96% in a given genome

Page 18: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

Rare variation is population specific

• 17% of low frequency (0.5-5%) in a single ancestry group

• 53% of less than 0.5% in a single population

• African populations have many more low frequency variants due to bottleneck on other lineages

• All populations are enriched in rare variants – Explosive recent population

growth

Slide Courtesy of Paul Flicek Adam Auton, Gil McVean

Page 19: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

Rare variants identify recent historical links between populations

48% of IBS variants shared with American populations

ASW shows stronger sharing with YRI than LWK

Adam Auton, Gil McVean

Page 20: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

The proportion of rare variants by conservation

Tuuli Lappalainen

Page 21: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

The proportion of rare variants by conservation

Tuuli Lappalainen

Page 22: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

The proportion of rare variants by conservation

Tuuli Lappalainen

Page 23: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

Implication for GWAS imputation

Bryan Howie, Hyun Min Kang

Page 24: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

BCM NGS PIPELINES: ATLAS2 & SNPTOOLS

24

Page 25: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

25

Overview of NGS variation analysis pipelines

Nielsen R 2011

SNPTools Atlas2

Page 26: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

26 26

Atlas uses logistic regression: systematic errors

DistbNQSbSwapbRawQualityb 4321 Pr(SNP)i1

Pr(SNP)ilog

Items Values derived from our

training experiment

Z

score

Significance

(p-value)

Intercept α -3.3 -39 <2e-16

Coefficient b1 for raw quality score 0.11 19 <2e-16

Coefficient b2 for swap -3.5 28 <2e-16

Coefficient b3 for NQS 0.26 3 0.001

Coefficient b4 for relative position -0.37 -4 0.0005

j=1 (0/1) 2 (0/0) m (0/0) . . . .

i=1,

2,

.

.

.,

n

Read

harboring

reference

alleles

Read

harboring

substitutions

Reference

sequence

Shen et al. 2010 Genome Research

Page 27: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

27 27

posterior Pr(SNP) using Bayesian

j=1 (0/1) 2 (0/0) m (0/0) . . . .

i=1,

2,

.

.

.,

n

Read

harboring

reference

alleles

Read

harboring

substitutions

Reference

sequence

Pr(error)i = 1 – Pr(SNP)i

Pr(error)j = ∏ Pr(error)i

Pr(SNP)j = 1- Pr(error)j = Sj

)|(),|Pr()|(),|Pr(

)|(),|Pr(),|Pr(

cerrorpriorcerrorScSNPpriorcSNPS

cSNPpriorcSNPScSSNP

jj

j

jj

Shen et al. 2010 Genome Research

Page 28: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

28 28

Exome data summary

• 1128 (822 Illumina/306 SOLiD) samples in 20110521.alignment.index

– 822 Illumina BAMs

• MOSAIK

– 306 SOLiD BAMs

• BFAST

• SNPs are called using Atlas-SNP2 at BCM

Page 29: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

29 29

Intersection #SNP:238,356 dbSNP: 48.5% Ti/Tv : 3.35

Baylor Exome Unique #SNP: 218,739 dbSNP: 8.2% Ti/Tv: 2.97

VQSR v2b Unique #SNP: 23,096 dbSNP: 15.3% Ti/Tv: 2.67

Exome SNP calls on consensus target regions

Platform #Sample # SNP %dbSNP

b132

Known Ti/Tv merged / per-

sample

Novel Ti/Tv merged / per-

sample

Illumina+ SOLiD 1128

457,095 29.23% 3.47/3.41 3.05/2.97

SOLiD 306

244,736 42.05% 3.54 / 3.51 3.19/ 3.03

Illumina 822

348,599 35.94% 3.46/3.37 2.99/2.95

Page 30: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

30

Effective Base

Depth

•Novel Effective Base Depth (EBD) summarization for each BAM

•High performance IO, small disk foot print (1~2GB per BAM)

SNP Site Discovery

•Novel variance ratio based site discovery statistics

•High sensitivity and specificity

Sequence Genotype Likelihood

•Novel BAM-specific binomial mixture modeling (BBMM)

•Capture BAM heterogeneity

Exist Genotype Integratio

n

•‘Dynamic linking’ of multiple exist genotype datasets with Bayesian style

•Improve both exist genotypes and sequence calls significantly

Genotype Imputatio

n

•Novel imputation engine

•High genotyping and phasing accuracy

Raw Sequence Reads (FASTQ)

Short Reads Mapping

Base Quality Recalibration

Binary sequence Alignment/Map Files (BAM)

Haplotype with Confidence Score (VCF)

Downstream Analysis

SNPTools pipeline overview

Page 31: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

31

EBD file format

Page 32: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

32

New algorithm for Genotype Likelihood

• Challenges in Raw Genotype Likelihood 1. Mapping/sequencing errors in site discovery

2. BAM heterogeneity, potential contamination

• Solutions 1. Novel concept of Effective Base Depth (EBD) to summarize

sequence details

2. BAM-specific binomial mixture model handles BAM heterogeneity

Page 33: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

33

Rationale

• BAM-specific modeling – Using whole-genome VQSR

sites

– Perform 3-component BBMM on each BAM using Phase I VQSR (38M) SNPs sites

– High precision modeling with 38M data points!

– Make SNP array free QC on individual BAMs

1094

BAMs

39

M V

QSR

SNPs

site specific modeling

BAM specific modeling

small learning size BAM heterogeneity

low accuracy for alt/alt

huge learning size high accuracy for alt/alt

as one QC metric

aara,rr,=g

giigi )e,a+B(rw=)P(r

Page 34: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

34

BBMM overcomes platform heterogeneity

Page 35: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

35

SOLiD GL: BBMM better than Samtools

HM3

OMNI

Hyun Min Kang Univ Mich

Page 36: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

36

Improvement of using BBMM GL also seen in Beagle

Hyun Min Kang Univ Mich

Page 37: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

37

SNPTools Imputation – ‘Constraint Li-Stephens’

Page 38: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

38

Phase I Genotypes: Chr1, Chr20 (released 2011-05-08)

call set OMNI HM3 Axiom

AA RA RR non-ref AA RA RR non-ref AA RA RR non-ref chr1 1.03 1.02 0.19 1.43 1.64 0.86 0.21 1.43 0.85 1.38 0.19 1.51

chr20 1.02 1.18 0.23 1.60 1.22 0.88 0.25 1.30 1.33 1.48 0.22 1.85 chr20 V4 1.33 1.21 0.37 2.02 1.20 0.83 0.25 1.26 1.36 1.45 0.21 1.83 chr20* 0.99 1.17 0.22 1.57 1.18 0.88 0.25 1.28 1.23 1.47 0.21 1.79

chr20 V4* 1.01 1.11 0.22 1.52 1.18 0.83 0.24 1.25 1.24 1.44 0.21 1.77

•chr1 and chr20 are based on new VQSR sites •chr20 V4 is based on old VQSR sites •chr20* and chr20 V4* are the overlapped sites between new VQSR and old VQSR

Chr20 genotype call set

Better OMNI concordance than V4 due to site/allele selection improvement

Similar accuracy on overlapped sites

Chr1 genotype call set

Slightly better than chr20 call set

Page 39: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

39

Phasing accuracy evaluation

Page 40: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

40

Integrating known array genotypes

raw genotype

probabilities

known genotypes

Direct re-weighting of overall

accuracy. Improvement is in

proportion to the number known

genotype integrated.

Imputation improvement of on-

array accuracy. Known

genotypes are treated as

99.98% confidence priors which

is still improvable.

Imputation improvement of off-

array accuracy. Make full use of

the LD between on and off

array sites.

sample

sites

Page 41: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

Integrating LowPass + ExomeOffTarget

41

Page 42: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

Exome off-target reads are evenly distributed

42

Page 43: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

Exome off-target reads improve sensitivity

•~5% improved sensitivity in off targets

Page 44: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

1000G NEW DEVELOPMENT & TIMELINE TO COMPLETION

44

Page 45: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

1000 Genomes Project Design and Progress

• Pilot data collected in 2008; paper published October 2010 in Nature

– Companions in Science and Genome Research

– Other companions later

• Full project data collection and analysis underway – Phase 1 results published Nov 1st 2012

– Phase 2 / Phase 3 being completed

• Sequencing completion - early 2013 – Analysis completion in 2013-2014

Page 46: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

46

1000G Phase 2/3 populations

ACB CDX

GHI KHV

PEL

CHD

GWD

MSL

ESN

PJL

BEB

STU

ITU

Page 47: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

BCM BI1 LU SI1 UM BC BI2 OX1 OX2 SI2

05

00

015

00

0

BCM BI1 LU SI1 UM BC BI2 OX1 OX2 SI2

020

00

00

50

00

00

Overview of AFR Phase 2 Call Set Sizes (chr20)

47

SNPs

BCM BI1 LU SI1 UM BC BI2 OX1 OX2 SI2

02

00

00

40

00

0

Indels/ Cplxsubs

MNPs

195K

511K 491K 480K 481K

362K 460K 452K

252K

0

17K

0 0

48K 42K 42K 44K

49K 46K

28K

0 0 0 0 0 4K

8K

19K

3K 206

Alignment-based Call Sets Assembly-based Call Sets

Adrian Tan, Hyun Min Kang

Page 48: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

A time-line

• Data generation (incl, LC, exome, CG, SNP arrays) by end March.

• Final alignment index from DCC by start June.

• Contributing call sets (SNP, indel, MNP, complex, SV) by end July

• Consensus and resolved site list with GLs by end August

• Integrated haplotypes by ASHG 2013

Gil McVean

Page 49: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

49 49

Acknowledgements

BCM-HGSC

• Yi Wang: SNPTOOLS

• Jin Yu: Atlas-SNP

• Danny Challis: Atlas-INDEL

• Uday Evani: VCFPRINTER

• Matthew Bainbridge

• Donna Muzny

• Jeffrey Reid

• Richard Gibbs

• Gabor Marth

• Amit Indap

• Wen Fung Leong

• Alistair Ward

Boston College

Broad Institute

• Mark DePristo

• Ryan Poplin

• Eric Banks

• Simon Gravel

• Carlos Bustamante

Stanford University

Univ of Michigan

• Goncalo Abecasis

• Hyun Min Kang

BCM-BRL

• Andrew R. Jackson

• Sameer Paithankar

• Cristian Coarfa

• Aleksandar Milosavljevic

BlueBioU@Rice University

• Kim Andrews

• Roger Moye

• Chandler Wilkerson

Page 50: Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

50

Postdoc positions available

Contact

Fuli Yu

[email protected]