agbt 2016 workshop lindsay
-
Upload
genome-reference-consortium -
Category
Science
-
view
985 -
download
0
Transcript of agbt 2016 workshop lindsay
MGI Reference Genomes Workshop
Tina Graves-LindsayFeb 10, 2016
The Human Reference is a Work in Progress!
• The current reference – GRCh38 - is not optimal for some regions of the genome and/or some individuals/ancestries.
• GRCh38 is comprised of DNA from several individual humans.
• Allelic diversity and structural variation present major challenges when assembling a representative diploid genome.
• New technologies, methods, and resources since 2003 have allowed for substantial improvements in the reference genome.
• Additional high-quality reference sequences are needed to represent the full range of genetic diversity in humans
Samples to be Sequenced
Sequencing Plan
Definitions of Genome Level• Platinum Genome
• Haploid genome source• Contiguous, haplotype-resolved representation of entire
genome• BAC library available
• Gold Genome• Diploid genome source• Part of a trio
• Parents will be sequenced to help haplotype resolve some regions
• BAC libraries available • Targeted regions sequenced using these BAC libraries• Will contain some haplotype resolved regions
CHM1: A Key Resource for Improving the Reference
• CHM1 cell line established from a haploid hydatidiform mole (complete, paternal; 46XX) (U.Surti)
• CHORI-17 BAC library (P. deJong)• CHORI-17 BAC end sequences (n=325,659)• CHORI-17 multiple enzyme fingerprint map (1,560 fpc contigs)• CHORI-17 BACs
• >750 have been sequenced• 664 of them in Genbank as phase 3 sequence
• CHM1 WGS assembly• Initial assembly produced from >100X coverage of Illumina data• Initial PacBio assembly produced using ~54X of P5 PacBio data• Latest PacBio assembly produced using ~60X of P6 PacBio data
CHM1 P5 vs P6 read length distributions
Mapped Concordance (%)
Frac
tion
of M
appe
d Ba
ses
% of Bases in Reads > 30,000 bases17.8 %0.05 %
CHM1 Assembly Comparisons
CHM1_2014P5 chemistry(54X)
CHM1_2015 P6 chemistry (61X) Jason Chin
CHM1_2015P6 chemistry (61X)Adam Phillippy
# Contigs 26,312 3,641 4,849
Max Contig Size
44,873,077 bp 109,312,888 bp 99,566,047 bp
Total Assembly Size
3,239,081,299 bp 2,996,426,293 bp 2,939,630,703 bp
N50 4,498,608 bp 26,899,841 bp 20,609,304 bp
N90 30,687 bp 1,686,030 bp 1,188,604 bp
N95 17,815 bp 149,494 bp 95,419 bp
Hybrid Scaffolds – PacBio and BioNano
Seq Assem
Seq Assem
Seq Assem
BN Hybrid
BN Hybrid
BN Hybrid
# of Contigs
Contig N50 (Mb)
Total Size (Gb)
# of Scaffolds
Scaff N50 (Mb)
Total Size (Gb)
CHM1 (P6)GCA_001297185MGI CHM1 map(Jason’s version)
3641 26.9 2.99 161 47.6 2.84
CHM1 (P6) GCA_001307025MGI CHM1 Map
(Adam’s version)
4850 20.6 2.94 221 40.04 2.82
Hybrid ScaffoldHybrid Scaffold
PacBio Contigs
BioNano Contigs
Using BioNano to Compare CHM1 Assemblies
CHM1 GCA_001297185Jason’s version
CHM1GCA_001307025Adam’s version
Hybrid WGS Conflicts 45 52Hybrid BN Conflicts 51 63SV - Deletions 35 25SV- Insertions 32 31SV- Inversions 7 12SV- End 126 190SV- Translocation_Interchr
332 529
Assembly Assessment Methods• Assemblies will run through NCBI QA pipeline
• Assessed for contiguity, annotation, and concordance with the finished BACs
• Assembly Assembly alignments will be generated between each PB assembly and GRCh38
• BioNano Genome Map• SV calls generated from comparing the BioNano data to
each of the assemblies • Hybrid scaffolding conflicts will also point out potential
assembly errors
• Alignment of the Illumina reads back to the each of the assemblies• Heterozygous calls are likely indicative of a collapse in the
assembly (for the haploid genomes)
1q21 Region – GRCh38 vs GCA_0012971851 Megabase
GRCh38
GCA_001297185
Seg Dup Track
1q21 Region - GRCh38 vs GCA_001297185
GRCh38
GCA_001297185
Seg Dup Track
99.9+% identity99.1% identity
First Gold Genome - NA19240
Initial Assembly Stats# Seq Contigs 3569Max Contig Length 20,393,869bpTotal Assembly Size 2,745,634,789 bpN50 6,003,115 bpN90 848,151 bpN95 345,457 bp
• NA19240 – Yoruban sample
• Generated >70X raw PacBio data
• Assembled on DNAnexus platform using Falcon pipeline
NA19240 BioNano Hybrid and SV Stats
Seq Assem
Seq Assem
Seq Asse
m
BN Hybrid
BN Hybrid
BN Hybrid
BN Hybrid
BN Hybrid
# of Contigs
Contig N50 (Mb)
Total Size (Gb)
# of Scaffold
s
Scaffold N50 (Mb)
Total Size (Gb)
Conflicts WGS
Conflicts BN
NA19240 DNAnexu
s
3569 6.01 2.75 421 14.78 2.74 49 60
Potential mis-
assemblies
Breaks made
Conflicts 28 22Ends 13 5Insertions 5 2Translocations
74 14
Alignment of NA19240 to BioNano map
Conflict identifiedBy BioNano data
Alignment to GRCh38
GRCh38
NA19240
CCL Region of NA19240 Assembly
GRCh38
Genes
Seg Dup
PB Assembly
1 Megabase
CCL Region with BAC alignments
GRCh38
BAC Alignments
Seg Dup
PB Assembly
100 Kb
BACs Will Resolve These Regions!
NA19240 BAC
NA19240 WGS
Which Assembly is Best?
2.815 2.820 2.825 2.830 2.835 2.840 2.845 2.8505.806.006.206.406.606.807.007.207.407.607.80
Contig Lengt
h N50 (MB)
Total Assembly Size (GB)
HG00733 Puerto Rican Assembly Stats
• Use other sources to assess multiple assemblies• BioNano• Linked long reads
Genome Status
Data Source
Origin Level of Coverage
Status
CHM1 NA Platinum Assembly AssessmentCHM13 NA Platinum Assembly Assessment
NA19240 Yoruban Gold Analysis UnderwayHG00733 Puerto
RicanGold Assembly QC
HG00514 Han Chinese
Gold Assembly QC
NA12878 European Gold Data Generation Underway
HG01352 Columbian Gold Not Started Yet
Next Steps
• Platinum Genomes• Select the best CHM1 and CHM13 assembly and then
improve those further using BioNano and other tools• Incorporate the BACs into the assemblies• Create Chromosomal AGPs
• Gold Genomes• Finish analysis of the first Gold Genome• Data production is now complete on two other Gold
genomes and assemblies for those are underway• Data production is underway on the 4th Gold genome• BACs are being sequenced for many of these genomes
AcknowledgementsThe McDonnell Genome Institute at Washington University in St. Louis
Rick WilsonBob FultonWes WarrenKaryn Meltz SteinbergVince MagriniSean McGrathDerek AlbrachtMilinn KremitzkiSusan RockDebbie ScheerChad Tomlinson
University of WashingtonEvan Eichler
NCBIValerie Schneider
University of Pittsburgh School of Medicine
(CHM1 and CHM13 cell line)Urvashi Surti
10X GenomicsDeanna Church
BioNano GenomicsPalak ShethAlex Hastie
Pacific BiosciencesJason ChinNick Sisneros
UCSFPui-Yan KwokYvonne LaiChin LinCatherine
Chu
NHGRIAdam PhillippySergey Koren
DovetailTodd Dickinson