Eukaryotic Genomes The Organization and Control of Eukaryotic Genomes.
Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)
-
Upload
australian-bioinformatics-network -
Category
Documents
-
view
516 -
download
4
description
Transcript of Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)
1 1
Integrative analysis in 1000 Genomes data
Fuli Yu
BioInfoSummer, Adelaide Australia
2012
Outline
• Background overview of 1000G
• 1000G Phase I results
• BCM NGS variation analysis software
• Further development and timeline
2
3
The history before the 1000 Genomes Project
-Phase I and II: common SNPs in CEU, CHB, JPT, YRI -HapMap3: 11 populations -Patterns of linkage disequilibrium and haplotypes defined genome-wide
www.hapmap.org
• Complex diseases gene mapping – GWAS. • Characteristics of the human genome variants: allele frequency spectrum, LD
patterns, recombination rate variation… • Population genetics: selection, migration, drift, admixture
Impacts
1,449 published
GWA at p≤5x10-8
for 237 traits
4 4
Disease mutations are likely rare and heterogeneous
McClellan J and King M-C, 2010
‘Clan Genomics’ Lupski JR et al. 2011
5 5
The quest for rare genetic variation
Gibbs R 2005
HapMap
1000G
6 6
Project goal
“…sequence a large number of people, to provide a
comprehensive resource on human genetic variation…”
“…find most genetic variants that have frequencies of at
least 1% in the populations studies…”
www.1000genomes.org
1000 Genomes Project Design and Progress
• Pilot data collected in 2008; paper published October 2010 in Nature
– Companions in Science and Genome Research
– Other companions later
• Full project data collection and analysis underway – Phase 1 results published Nov 1st 2012
– Phase 2 / Phase 3 being completed
• Sequencing completion - early 2013 – Analysis completion in 2013-2014
8 8
Nature, Oct 2010
-179 WGS, 700 exon seq
-15M new SNPs
-CNV group
-Exon group
1000 Genomes Project Design and Progress
• Pilot data collected in 2008; paper published October 2010 in Nature
– Companions in Science and Genome Research
– Other companions later
• Full project data collection and analysis underway – Phase 1 results published Nov 1st 2012
– Phase 2 / Phase 3 being completed
• Sequencing completion - early 2013 – Analysis completion in 2013-2014
10
1000G Phase I populations
Mark DePristo
12
An integrative map of 40 million variants Low-pass Genomes
SNPs 38M
Low-pass Genomes Low-pass Genomes Low-pass Genomes
Low-pass Genomes
Low-pass Genomes Low-pass Genomes Low-pass Genomes Low-pass Genomes
Deep Exomes
INDELs 1.4M
SVs 14k
Integrated Genotypes ~40M
Hyun Min Kang
1000 Genomes Project Design and Progress
• Pilot data collected in 2008; paper published October 2010 in Nature
– Companions in Science and Genome Research
– Other companions later
• Full project data collection and analysis underway – Phase 1 results published Nov 1st 2012
– Phase 2 / Phase 3 being completed
• Sequencing completion - early 2013 – Analysis completion in 2013-2014
Discovery power
• 1% SNPs
– 99.3% genome / 99.8% exome
• 0.1% SNPs
– 70% genome / 90% exome
- Exome high r2>0.9 - with LD information, WGS genotype - improves MAF>=1% by 30-40% - unchanges MAF<0.1%
Phase 1 variants are of high quality
Overall genotype accuracy at ~99%
Hyun Min Kang
Hyun Min Kang
Sensitivity >96% in a given genome
Rare variation is population specific
• 17% of low frequency (0.5-5%) in a single ancestry group
• 53% of less than 0.5% in a single population
• African populations have many more low frequency variants due to bottleneck on other lineages
• All populations are enriched in rare variants – Explosive recent population
growth
Slide Courtesy of Paul Flicek Adam Auton, Gil McVean
Rare variants identify recent historical links between populations
48% of IBS variants shared with American populations
ASW shows stronger sharing with YRI than LWK
Adam Auton, Gil McVean
The proportion of rare variants by conservation
Tuuli Lappalainen
The proportion of rare variants by conservation
Tuuli Lappalainen
The proportion of rare variants by conservation
Tuuli Lappalainen
Implication for GWAS imputation
Bryan Howie, Hyun Min Kang
BCM NGS PIPELINES: ATLAS2 & SNPTOOLS
24
25
Overview of NGS variation analysis pipelines
Nielsen R 2011
SNPTools Atlas2
26 26
Atlas uses logistic regression: systematic errors
DistbNQSbSwapbRawQualityb 4321 Pr(SNP)i1
Pr(SNP)ilog
Items Values derived from our
training experiment
Z
score
Significance
(p-value)
Intercept α -3.3 -39 <2e-16
Coefficient b1 for raw quality score 0.11 19 <2e-16
Coefficient b2 for swap -3.5 28 <2e-16
Coefficient b3 for NQS 0.26 3 0.001
Coefficient b4 for relative position -0.37 -4 0.0005
j=1 (0/1) 2 (0/0) m (0/0) . . . .
i=1,
2,
.
.
.,
n
Read
harboring
reference
alleles
Read
harboring
substitutions
Reference
sequence
Shen et al. 2010 Genome Research
27 27
posterior Pr(SNP) using Bayesian
j=1 (0/1) 2 (0/0) m (0/0) . . . .
i=1,
2,
.
.
.,
n
Read
harboring
reference
alleles
Read
harboring
substitutions
Reference
sequence
Pr(error)i = 1 – Pr(SNP)i
Pr(error)j = ∏ Pr(error)i
Pr(SNP)j = 1- Pr(error)j = Sj
)|(),|Pr()|(),|Pr(
)|(),|Pr(),|Pr(
cerrorpriorcerrorScSNPpriorcSNPS
cSNPpriorcSNPScSSNP
jj
j
jj
Shen et al. 2010 Genome Research
28 28
Exome data summary
• 1128 (822 Illumina/306 SOLiD) samples in 20110521.alignment.index
– 822 Illumina BAMs
• MOSAIK
– 306 SOLiD BAMs
• BFAST
• SNPs are called using Atlas-SNP2 at BCM
29 29
Intersection #SNP:238,356 dbSNP: 48.5% Ti/Tv : 3.35
Baylor Exome Unique #SNP: 218,739 dbSNP: 8.2% Ti/Tv: 2.97
VQSR v2b Unique #SNP: 23,096 dbSNP: 15.3% Ti/Tv: 2.67
Exome SNP calls on consensus target regions
Platform #Sample # SNP %dbSNP
b132
Known Ti/Tv merged / per-
sample
Novel Ti/Tv merged / per-
sample
Illumina+ SOLiD 1128
457,095 29.23% 3.47/3.41 3.05/2.97
SOLiD 306
244,736 42.05% 3.54 / 3.51 3.19/ 3.03
Illumina 822
348,599 35.94% 3.46/3.37 2.99/2.95
30
Effective Base
Depth
•Novel Effective Base Depth (EBD) summarization for each BAM
•High performance IO, small disk foot print (1~2GB per BAM)
SNP Site Discovery
•Novel variance ratio based site discovery statistics
•High sensitivity and specificity
Sequence Genotype Likelihood
•Novel BAM-specific binomial mixture modeling (BBMM)
•Capture BAM heterogeneity
Exist Genotype Integratio
n
•‘Dynamic linking’ of multiple exist genotype datasets with Bayesian style
•Improve both exist genotypes and sequence calls significantly
Genotype Imputatio
n
•Novel imputation engine
•High genotyping and phasing accuracy
Raw Sequence Reads (FASTQ)
Short Reads Mapping
Base Quality Recalibration
Binary sequence Alignment/Map Files (BAM)
Haplotype with Confidence Score (VCF)
Downstream Analysis
SNPTools pipeline overview
31
EBD file format
32
New algorithm for Genotype Likelihood
• Challenges in Raw Genotype Likelihood 1. Mapping/sequencing errors in site discovery
2. BAM heterogeneity, potential contamination
• Solutions 1. Novel concept of Effective Base Depth (EBD) to summarize
sequence details
2. BAM-specific binomial mixture model handles BAM heterogeneity
33
Rationale
• BAM-specific modeling – Using whole-genome VQSR
sites
– Perform 3-component BBMM on each BAM using Phase I VQSR (38M) SNPs sites
– High precision modeling with 38M data points!
– Make SNP array free QC on individual BAMs
1094
BAMs
39
M V
QSR
SNPs
site specific modeling
BAM specific modeling
small learning size BAM heterogeneity
low accuracy for alt/alt
huge learning size high accuracy for alt/alt
as one QC metric
aara,rr,=g
giigi )e,a+B(rw=)P(r
34
BBMM overcomes platform heterogeneity
35
SOLiD GL: BBMM better than Samtools
HM3
OMNI
Hyun Min Kang Univ Mich
36
Improvement of using BBMM GL also seen in Beagle
Hyun Min Kang Univ Mich
37
SNPTools Imputation – ‘Constraint Li-Stephens’
38
Phase I Genotypes: Chr1, Chr20 (released 2011-05-08)
call set OMNI HM3 Axiom
AA RA RR non-ref AA RA RR non-ref AA RA RR non-ref chr1 1.03 1.02 0.19 1.43 1.64 0.86 0.21 1.43 0.85 1.38 0.19 1.51
chr20 1.02 1.18 0.23 1.60 1.22 0.88 0.25 1.30 1.33 1.48 0.22 1.85 chr20 V4 1.33 1.21 0.37 2.02 1.20 0.83 0.25 1.26 1.36 1.45 0.21 1.83 chr20* 0.99 1.17 0.22 1.57 1.18 0.88 0.25 1.28 1.23 1.47 0.21 1.79
chr20 V4* 1.01 1.11 0.22 1.52 1.18 0.83 0.24 1.25 1.24 1.44 0.21 1.77
•chr1 and chr20 are based on new VQSR sites •chr20 V4 is based on old VQSR sites •chr20* and chr20 V4* are the overlapped sites between new VQSR and old VQSR
Chr20 genotype call set
Better OMNI concordance than V4 due to site/allele selection improvement
Similar accuracy on overlapped sites
Chr1 genotype call set
Slightly better than chr20 call set
39
Phasing accuracy evaluation
40
Integrating known array genotypes
raw genotype
probabilities
known genotypes
Direct re-weighting of overall
accuracy. Improvement is in
proportion to the number known
genotype integrated.
Imputation improvement of on-
array accuracy. Known
genotypes are treated as
99.98% confidence priors which
is still improvable.
Imputation improvement of off-
array accuracy. Make full use of
the LD between on and off
array sites.
sample
sites
Integrating LowPass + ExomeOffTarget
41
Exome off-target reads are evenly distributed
42
Exome off-target reads improve sensitivity
•~5% improved sensitivity in off targets
1000G NEW DEVELOPMENT & TIMELINE TO COMPLETION
44
1000 Genomes Project Design and Progress
• Pilot data collected in 2008; paper published October 2010 in Nature
– Companions in Science and Genome Research
– Other companions later
• Full project data collection and analysis underway – Phase 1 results published Nov 1st 2012
– Phase 2 / Phase 3 being completed
• Sequencing completion - early 2013 – Analysis completion in 2013-2014
46
1000G Phase 2/3 populations
ACB CDX
GHI KHV
PEL
CHD
GWD
MSL
ESN
PJL
BEB
STU
ITU
BCM BI1 LU SI1 UM BC BI2 OX1 OX2 SI2
05
00
015
00
0
BCM BI1 LU SI1 UM BC BI2 OX1 OX2 SI2
020
00
00
50
00
00
Overview of AFR Phase 2 Call Set Sizes (chr20)
47
SNPs
BCM BI1 LU SI1 UM BC BI2 OX1 OX2 SI2
02
00
00
40
00
0
Indels/ Cplxsubs
MNPs
195K
511K 491K 480K 481K
362K 460K 452K
252K
0
17K
0 0
48K 42K 42K 44K
49K 46K
28K
0 0 0 0 0 4K
8K
19K
3K 206
Alignment-based Call Sets Assembly-based Call Sets
Adrian Tan, Hyun Min Kang
A time-line
• Data generation (incl, LC, exome, CG, SNP arrays) by end March.
• Final alignment index from DCC by start June.
• Contributing call sets (SNP, indel, MNP, complex, SV) by end July
• Consensus and resolved site list with GLs by end August
• Integrated haplotypes by ASHG 2013
Gil McVean
49 49
Acknowledgements
BCM-HGSC
• Yi Wang: SNPTOOLS
• Jin Yu: Atlas-SNP
• Danny Challis: Atlas-INDEL
• Uday Evani: VCFPRINTER
• Matthew Bainbridge
• Donna Muzny
• Jeffrey Reid
• Richard Gibbs
• Gabor Marth
• Amit Indap
• Wen Fung Leong
• Alistair Ward
Boston College
Broad Institute
• Mark DePristo
• Ryan Poplin
• Eric Banks
• Simon Gravel
• Carlos Bustamante
Stanford University
Univ of Michigan
• Goncalo Abecasis
• Hyun Min Kang
BCM-BRL
• Andrew R. Jackson
• Sameer Paithankar
• Cristian Coarfa
• Aleksandar Milosavljevic
BlueBioU@Rice University
• Kim Andrews
• Roger Moye
• Chandler Wilkerson