Mar2013 RM Characterization Working Group

28
© 2010 Illumina, Inc. All rights reserved. Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro, GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners. Platinum Genomes: Towards a comprehensive truth data set Michael A. Eberle Morten Kallberg, Han-Yu Chuang

Transcript of Mar2013 RM Characterization Working Group

Page 1: Mar2013 RM Characterization Working Group

© 2010 Illumina, Inc. All rights reserved.

Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro,

GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.

Platinum Genomes:

Towards a

comprehensive

truth data set

Michael A. Eberle

Morten Kallberg, Han-Yu Chuang

Page 2: Mar2013 RM Characterization Working Group

2

Platinum Genome project: Goals

Problem: No comprehensive truth set of variant calls for validation

Solution: Sequence and analyze large family pedigree

Use Mendelian inheritance to identify good / bad variant calls

– Including SNPs, indels & SVs

Aggressively incorporate variant calls

– Incorporate multiple algorithms and sequencing technologies

– Do not limit this just to what is currently easy to call

Make the data available publicly

– Both raw data and processed calls with accuracy assessment

Re-assess algorithms against a better truth data

– Better and more comprehensive truth data will allow for rapid advances in software

Page 3: Mar2013 RM Characterization Working Group

3

Using inheritance to detect conflicts: trio analysis

MOM DAD

When we do a trio analysis like this only 50% of the parents DNA is passed on to

the child so many of the variants will only be called in one parent

– Have no power to detect false positives in the parents

A trio analysis is also not very sensitive to detecting errors

– For example if father is AC and mother is AC then the child can be AA, AC or CC and

still be consistent with Mendelian inheritance

– Many errors occur at sites that are systematically het but trio analysis assumes that

these are correct

Father’s chromosomes

Mother’s chromosomes

CHILD

Child receives blue chromosome from mother and

green chromosome from father: e.g. typical trio analysis

Page 4: Mar2013 RM Characterization Working Group

4

Using inheritance to determine accuracy: larger pedigree

MOM DAD 1 3 2 6 5 4 7

CHILDREN

A T A A T A A A T A A A T A A A A A

T A A A A A T A A A T A A A T A T A

A A A T A T A T A A A A A A A T A A

A A T A A A A A A T A T A T A A A T

A A A T A T A T A A A A A A A T A A

OBSERVED GENOTYPES

Po

ssib

le G

T P

atte

rns

Page 5: Mar2013 RM Characterization Working Group

5

Using inheritance to determine accuracy: larger pedigree

A T A A T A A A T A A A T A A A A A

T A A A A A T A A A T A A A T A T A

A A A T A T A T A A A A A A A T A A

A A T A A A A A A T A T A T A A A T

A A A T A T A T A A A A A A A T A A

OBSERVED GENOTYPES

6

MOM DAD 1 3 2 6 5 4 7

# Errors / Hamming Distance

Page 6: Mar2013 RM Characterization Working Group

6

Using inheritance to determine accuracy: larger pedigree

A T A A T A A A T A A A T A A A A A

T A A A A A T A A A T A A A T A T A

A A A T A T A T A A A A A A A T A A

A A T A A A A A A T A T A T A A A T

A A A T A T A T A A A A A A A T A A

OBSERVED GENOTYPES

6

5

MOM DAD 1 3 2 6 5 4 7

Page 7: Mar2013 RM Characterization Working Group

7

Using inheritance to determine accuracy: larger pedigree

A T A A T A A A T A A A T A A A A A

T A A A A A T A A A T A A A T A T A

A A A T A T A T A A A A A A A T A A

A A T A A A A A A T A T A T A A A T

A A A T A T A T A A A A A A A T A A

OBSERVED GENOTYPES

6

5

0

MOM DAD 1 3 2 6 5 4 7

Page 8: Mar2013 RM Characterization Working Group

8

Using inheritance to determine accuracy: larger pedigree

A T A A T A A A T A A A T A A A A A

T A A A A A T A A A T A A A T A T A

A A A T A T A T A A A A A A A T A A

A A T A A A A A A T A T A T A A A T

A A A T A T A T A A A A A A A T A A

OBSERVED GENOTYPES

6

5

0

7

MOM DAD 1 3 2 6 5 4 7

Page 9: Mar2013 RM Characterization Working Group

9

Using inheritance to determine accuracy: larger pedigree

A T A A T A A A T A A A T A A A A A

T A A A A A T A A A T A A A T A T A

A A A T A T A T A A A A A A A T A A

A A T A A A A A A T A T A T A A A T

A A A T A T A T A A A A A A A T A A

100% consistent therefore we predict that all genotypes are correct

OBSERVED GENOTYPES

6

5

0

7

MOM DAD 1 3 2 6 5 4 7

Page 10: Mar2013 RM Characterization Working Group

10

Platinum Genomes - CEPH/Utah Pedigree 1463

All 17 members sequenced to at least 50x depth (PCR-Free protocol)

– SNPs & indels called using BWA + GATK + VQSR

Each member of the trio highlighted in bold is sequenced to 200x

An additional 200x technical replicate was done for NA12882

12889 12890 12891 12892

12877 12878

12879 12880 12881 12882 12883 12884 12885 12887 12886 12888 12893

12877 12878

12882

Analysis of SNPs in

the parents and 11

children

Page 11: Mar2013 RM Characterization Working Group

11

50x raw data was aligned and variants called using BWA + GATK + VQSR

– Accurate calls were supplemented with accurate variant calls made by Cortex using

the same sequence data and accurate CGI calls made across the same pedigree

First step is to define the inheritance of the parental chromosomes to the eleven

children everywhere in the genome

– Identified 709 crossover events between the parents and eleven children

Define accurate variants as those where the genotypes are 100% consistent

with the transmission of the parental haplotypes

– At any position of the genome there are only 16 possible combinations of genotypes

(biallelic & diploid) across the pedigree that are consistent with the inheritance pattern

– 313 (~1.6M) possible genotype combinations

Subsequent analysis mostly excludes all variants that are homozygous

alternative across the last two generations of this pedigree (~750k)

– Mostly will be accurate but for these “trivially consistent” sites we cannot differentiate

accurate from systematic errors or validate ploidy

Analysis of the data

Page 12: Mar2013 RM Characterization Working Group

12

Set C

Set B

Set A

Compare Against

Inheritance

Score (plat./gold)

db w/score

Assess Problem

Score (gold/silver)

db w/comments

Comment

db w/comments

NO CONFLICTS CONFLICTS

BIOLOGY BAD

Input all possible data and

use the inheritance to

separate good from bad:

Variants are unlikely to

accidentally match

inheritance

Page 13: Mar2013 RM Characterization Working Group

13

Cataloging the accurate SNPs

Page 14: Mar2013 RM Characterization Working Group

14

All Pass Filtered0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Co

un

ts (

Mil

lio

ns)

Accurate SNP positions based on the pedigree analysis

408,915

3,217,748

Correct

Problematic

Pedigree Analysis

Additional 754,014

SNPs are “trivially

consistent” – i.e. all 13

samples are hom alt.

GATK Site Description*

Normally might exclude

these from our analysis

because the variant

caller filtered some of the

calls

*Filtered means that at least one variant call was called but quality filtered

Page 15: Mar2013 RM Characterization Working Group

15

0 1 2 3 4 5 6 7 8 9 10 11 12 130

20

40

60

Hamming Distance

Per

cen

t

Hamming distance for the “accurate” SNPs to the 2nd best

solution

At these sites >85% of the

positions would require at least

four (very specific) genotype

errors to have erroneously ended

up with the observed predicted-

accurate calls

Page 16: Mar2013 RM Characterization Working Group

16

Cortex CGI0

20

40

60

Counts

(x1000)

Using other call sets for a more comprehensive catalogue

22,922 (0.6%)

57,270 (1.6%)

Unique

Common

Pedigree Analysis

Page 17: Mar2013 RM Characterization Working Group

17

Concordance between “pedigree-accurate” GTs

Comparison* # Sites # Diff GTs # Same

GTs

GT

Concordance

GATK & Cortex 2,053,136 5 26,690,763 99.99998%

GATK & CGI 3,146,399 19 40,903,168 99.99995%

Cortex & CGI 1,890,718 7 24,579,327 99.99997%

*Excluding sites where alleles did not match or all samples homozygous alternative

Includes 763,085 GT calls and 264,771 positions quality filtered by GATK

Attempting to validate a sample of the sites that are unique to a single call set

– Targeting ~300 per call set

Page 18: Mar2013 RM Characterization Working Group

18

Indel analysis

Page 19: Mar2013 RM Characterization Working Group

19

All Pass Filtered0

50

100

150

200

250

Co

un

ts (

tho

usa

nd

s)Accurate GATK indel positions based on pedigree

141,508

240,490

Correct

Problematic

Pedigree Analysis

Additional 115,587

indels are “trivially

consistent” – i.e. all 13

samples are hom alt.

Site Description

Page 20: Mar2013 RM Characterization Working Group

20

Cortex CGI0

20

40

60

Counts

(x1000)

Using other call sets for a more comprehensive catalogue

39,335 (10%)

9,637 (2.4%)

Unique

Common

Pedigree Analysis

Page 21: Mar2013 RM Characterization Working Group

21

Concordance between overlapping “accurate” indels

Comparison*1 # Sites # Diff GTs # Same

GTs

GT

Concordance

GATK & Cortex 96,228 43 1,250,921 99.997%

GATK & CGI 219,445 2,817 2,514,785 99.901%

Cortex & CGI 78,050 198 1,014,650 99.981%

*Excluding sites where alleles did not match or all samples homozygous alternative

Attempting to validate a sample of the sites that are unique to a single call set

– Targeting ~300 per call set

Page 22: Mar2013 RM Characterization Working Group

22

CNVs

Page 23: Mar2013 RM Characterization Working Group

23

Conflict mode: Hemizygous deletions

A T A A T A A A T A A A T A A A A A

A A A T A T A T A A A A A A A T A A

A A T A A A A A A T A T A T A A A T

OBSERVED GENOTYPES

6

7

2

7

MOM DAD 1 3 2 6 5 4 7

A A A T A T T T A A A A A A T T A A

T A A A A A T A A A T A A A T A T A

“Best” solution still indicates multiple errors

Page 24: Mar2013 RM Characterization Working Group

24

A - A T - T A T - A A A - A A T A A

- A T A A A - A A T - T A T - A - T

Conflict mode: Hemizygous deletions

- A A T A T - T A A - A A A - T - A

A - T A - A A A - T A T - T A A A T

A A A T A T T T A A A A A A T T A A

100% consistent therefore we predict that there is a deletion

OBSERVED GENOTYPES

6

5

0

7

Hamming distance will be less when including deletions so need to be careful

MOM DAD 1 3 2 6 5 4 7

Page 25: Mar2013 RM Characterization Working Group

25

0 20 40 60 80 1000

1000

2000

3000

4000

5000

Depth

Co

un

ts

Read depth of 5,180 SNPs predicted to overlap deletions

Depth shown for positions where

the genotypes indicate that the

SNP overlaps a deletion. Large

number of children allows us to

more-reliably separate errors

from deletions.

Haploid Diploid Hom Del

AA AB

BB

A- AB

-B

Page 26: Mar2013 RM Characterization Working Group

26

Have many potential large deletions to validate…

5,180 SNPs are predicted to overlap a hemizygous deletion

These SNPs cluster into ~902 unique events

– Clusters show evidence for ~279 deletions >1kb segregating in this pedigree

– Largest event is >152kb with 274 SNPs supporting the call

Have begun validating these events beyond just visual inspection

– 132 overlap with previously reported events (1kGP)

– Working to define the breakpoints for wet lab validation

Incorporating other calling methods (Cortex, breakdancer…)

Some SNPs also support the presence of duplications in a single parent

Page 27: Mar2013 RM Characterization Working Group

27

We have sequenced a large pedigree and used the inheritance information to

create a catalogue of ~4.45M accurate SNP calls

– Over 3.7M biallelic SNPs agree with transmission of parental chromosomes

– Over 750k homozygous alternative SNPs are trivially accurate across the pedigree

Have called indels using four different methods also to produce over 550k

“accurate” indel calls across the pedigree

– Over 428k bi-allelic indels agree with transmission of parental chromosomes

– Over 110k homozygous alternative indels are trivially accurate across the pedigree

Concordance for the bi-allelic, pedigree-accurate calls is >99.9999% for SNPs

and 99.9% for indels between call sets

SVs are in progress (just deletions right now)

The SNP and indel results presented here can be used for comparison

– Incorporating homozygous reference calls across the pedigree for completeness

– May see immediate gains by testing new algorithms against a better truth set

Summary

Page 28: Mar2013 RM Characterization Working Group

28

Acknowledgements

Morten Kallberg – alignment & variant calling

Han-Yu Chuang – analysis of SNP calls

Phil Tedder – validation of de novo SNPs

Sean Humphray

Epameinondas Fritzilas

Wendy Wong

David Bentley

Elliott Margulies