agbt 2016 workshop lindsay

25
MGI Reference Genomes Workshop Tina Graves-Lindsay Feb 10, 2016

Transcript of agbt 2016 workshop lindsay

Page 1: agbt 2016 workshop lindsay

MGI Reference Genomes Workshop

Tina Graves-LindsayFeb 10, 2016

Page 2: agbt 2016 workshop lindsay

The Human Reference is a Work in Progress!

• The current reference – GRCh38 - is not optimal for some regions of the genome and/or some individuals/ancestries.

• GRCh38 is comprised of DNA from several individual humans.

• Allelic diversity and structural variation present major challenges when assembling a representative diploid genome.

• New technologies, methods, and resources since 2003 have allowed for substantial improvements in the reference genome.

• Additional high-quality reference sequences are needed to represent the full range of genetic diversity in humans

Page 3: agbt 2016 workshop lindsay

Samples to be Sequenced

Page 4: agbt 2016 workshop lindsay

Sequencing Plan

Page 5: agbt 2016 workshop lindsay

Definitions of Genome Level• Platinum Genome

• Haploid genome source• Contiguous, haplotype-resolved representation of entire

genome• BAC library available

• Gold Genome• Diploid genome source• Part of a trio

• Parents will be sequenced to help haplotype resolve some regions

• BAC libraries available • Targeted regions sequenced using these BAC libraries• Will contain some haplotype resolved regions

Page 6: agbt 2016 workshop lindsay

CHM1: A Key Resource for Improving the Reference

• CHM1 cell line established from a haploid hydatidiform mole (complete, paternal; 46XX) (U.Surti)

• CHORI-17 BAC library (P. deJong)• CHORI-17 BAC end sequences (n=325,659)• CHORI-17 multiple enzyme fingerprint map (1,560 fpc contigs)• CHORI-17 BACs

• >750 have been sequenced• 664 of them in Genbank as phase 3 sequence

• CHM1 WGS assembly• Initial assembly produced from >100X coverage of Illumina data• Initial PacBio assembly produced using ~54X of P5 PacBio data• Latest PacBio assembly produced using ~60X of P6 PacBio data

Page 7: agbt 2016 workshop lindsay

CHM1 P5 vs P6 read length distributions

Mapped Concordance (%)

Frac

tion

of M

appe

d Ba

ses

% of Bases in Reads > 30,000 bases17.8 %0.05 %

Page 8: agbt 2016 workshop lindsay

CHM1 Assembly Comparisons

CHM1_2014P5 chemistry(54X)

CHM1_2015 P6 chemistry (61X) Jason Chin

CHM1_2015P6 chemistry (61X)Adam Phillippy

# Contigs 26,312 3,641 4,849

Max Contig Size

44,873,077 bp 109,312,888 bp 99,566,047 bp

Total Assembly Size

3,239,081,299 bp 2,996,426,293 bp 2,939,630,703 bp

N50 4,498,608 bp 26,899,841 bp 20,609,304 bp

N90 30,687 bp 1,686,030 bp 1,188,604 bp

N95 17,815 bp 149,494 bp 95,419 bp

Page 9: agbt 2016 workshop lindsay

Hybrid Scaffolds – PacBio and BioNano

Seq Assem

Seq Assem

Seq Assem

BN Hybrid

BN Hybrid

BN Hybrid

# of Contigs

Contig N50 (Mb)

Total Size (Gb)

# of Scaffolds

Scaff N50 (Mb)

Total Size (Gb)

CHM1 (P6)GCA_001297185MGI CHM1 map(Jason’s version)

3641 26.9 2.99 161 47.6 2.84

CHM1 (P6) GCA_001307025MGI CHM1 Map

(Adam’s version)

4850 20.6 2.94 221 40.04 2.82

Page 10: agbt 2016 workshop lindsay

Hybrid ScaffoldHybrid Scaffold

PacBio Contigs

BioNano Contigs

Page 11: agbt 2016 workshop lindsay

Using BioNano to Compare CHM1 Assemblies

CHM1 GCA_001297185Jason’s version

CHM1GCA_001307025Adam’s version

Hybrid WGS Conflicts 45 52Hybrid BN Conflicts 51 63SV - Deletions 35 25SV- Insertions 32 31SV- Inversions 7 12SV- End 126 190SV- Translocation_Interchr

332 529

Page 12: agbt 2016 workshop lindsay

Assembly Assessment Methods• Assemblies will run through NCBI QA pipeline

• Assessed for contiguity, annotation, and concordance with the finished BACs

• Assembly Assembly alignments will be generated between each PB assembly and GRCh38

• BioNano Genome Map• SV calls generated from comparing the BioNano data to

each of the assemblies • Hybrid scaffolding conflicts will also point out potential

assembly errors

• Alignment of the Illumina reads back to the each of the assemblies• Heterozygous calls are likely indicative of a collapse in the

assembly (for the haploid genomes)

Page 13: agbt 2016 workshop lindsay

1q21 Region – GRCh38 vs GCA_0012971851 Megabase

GRCh38

GCA_001297185

Seg Dup Track

Page 14: agbt 2016 workshop lindsay

1q21 Region - GRCh38 vs GCA_001297185

GRCh38

GCA_001297185

Seg Dup Track

99.9+% identity99.1% identity

Page 15: agbt 2016 workshop lindsay

First Gold Genome - NA19240

Initial Assembly Stats# Seq Contigs 3569Max Contig Length 20,393,869bpTotal Assembly Size 2,745,634,789 bpN50 6,003,115 bpN90 848,151 bpN95 345,457 bp

• NA19240 – Yoruban sample

• Generated >70X raw PacBio data

• Assembled on DNAnexus platform using Falcon pipeline

Page 16: agbt 2016 workshop lindsay

NA19240 BioNano Hybrid and SV Stats

Seq Assem

Seq Assem

Seq Asse

m

BN Hybrid

BN Hybrid

BN Hybrid

BN Hybrid

BN Hybrid

# of Contigs

Contig N50 (Mb)

Total Size (Gb)

# of Scaffold

s

Scaffold N50 (Mb)

Total Size (Gb)

Conflicts WGS

Conflicts BN

NA19240 DNAnexu

s

3569 6.01 2.75 421 14.78 2.74 49 60

Potential mis-

assemblies

Breaks made

Conflicts 28 22Ends 13 5Insertions 5 2Translocations

74 14

Page 17: agbt 2016 workshop lindsay

Alignment of NA19240 to BioNano map

Conflict identifiedBy BioNano data

Page 18: agbt 2016 workshop lindsay

Alignment to GRCh38

GRCh38

NA19240

Page 19: agbt 2016 workshop lindsay

CCL Region of NA19240 Assembly

GRCh38

Genes

Seg Dup

PB Assembly

1 Megabase

Page 20: agbt 2016 workshop lindsay

CCL Region with BAC alignments

GRCh38

BAC Alignments

Seg Dup

PB Assembly

100 Kb

Page 21: agbt 2016 workshop lindsay

BACs Will Resolve These Regions!

NA19240 BAC

NA19240 WGS

Page 22: agbt 2016 workshop lindsay

Which Assembly is Best?

2.815 2.820 2.825 2.830 2.835 2.840 2.845 2.8505.806.006.206.406.606.807.007.207.407.607.80

Contig Lengt

h N50 (MB)

Total Assembly Size (GB)

HG00733 Puerto Rican Assembly Stats

• Use other sources to assess multiple assemblies• BioNano• Linked long reads

Page 23: agbt 2016 workshop lindsay

Genome Status

Data Source

Origin Level of Coverage

Status

CHM1 NA Platinum Assembly AssessmentCHM13 NA Platinum Assembly Assessment

NA19240 Yoruban Gold Analysis UnderwayHG00733 Puerto

RicanGold Assembly QC

HG00514 Han Chinese

Gold Assembly QC

NA12878 European Gold Data Generation Underway

HG01352 Columbian Gold Not Started Yet

Page 24: agbt 2016 workshop lindsay

Next Steps

• Platinum Genomes• Select the best CHM1 and CHM13 assembly and then

improve those further using BioNano and other tools• Incorporate the BACs into the assemblies• Create Chromosomal AGPs

• Gold Genomes• Finish analysis of the first Gold Genome• Data production is now complete on two other Gold

genomes and assemblies for those are underway• Data production is underway on the 4th Gold genome• BACs are being sequenced for many of these genomes

Page 25: agbt 2016 workshop lindsay

AcknowledgementsThe McDonnell Genome Institute at Washington University in St. Louis

Rick WilsonBob FultonWes WarrenKaryn Meltz SteinbergVince MagriniSean McGrathDerek AlbrachtMilinn KremitzkiSusan RockDebbie ScheerChad Tomlinson

University of WashingtonEvan Eichler

NCBIValerie Schneider

University of Pittsburgh School of Medicine

(CHM1 and CHM13 cell line)Urvashi Surti

10X GenomicsDeanna Church

BioNano GenomicsPalak ShethAlex Hastie

Pacific BiosciencesJason ChinNick Sisneros

UCSFPui-Yan KwokYvonne LaiChin LinCatherine

Chu

NHGRIAdam PhillippySergey Koren

DovetailTodd Dickinson