The importance of high quality reference genome assemblies to personal and medical genomics
-
Upload
kmsteinberg -
Category
Health & Medicine
-
view
2.331 -
download
3
Transcript of The importance of high quality reference genome assemblies to personal and medical genomics
The importance of high quality reference genome assemblies to personal and medical genomics
Karyn Meltz Steinberg Genome Informatics 2015
@KMS_Meltzy
0
100000
200000
300000
400000
CHM1_1.1 HuRef ALLPATHS YH_2.0
Contig Number
Contig N50
Figure 1Last year…
Steinberg et al, 2014
This year…
0
5000000
10000000
15000000
20000000
25000000
30000000
CHM13 Draft
CHM1 PB_2
CHM1 PB_1
CHM1_1.1 HuRef ALLPATHS YH_2.0
Contig Number
Contig N50
This year…
Log scale
1
10
100
1000
10000
100000
1000000
10000000
100000000
CHM13 Draft
CHM1 PB_2
CHM1 PB_1
CHM1_1.1 HuRef ALLPATHS YH_2.0
Contig Number
Contig N50
How do we define platinum and gold standards?
GRCh38 Platinum (CHM1)
Gold (NA19240)
% Reference genome covered 100 98.40 90.80
% Assigned chromosomes 99.60 98.40 90.80
% gene models covered (>95% id, >90% length) 99.96 98.78 94.26
Contig N50 67.8 Mb 26.9 Mb 6.0 Mb
Number of gaps 875 3,640 3,568
Total Assembled size 3.067 Gb 2.996 Gb 2.745 Gb
% haplotype blocks (>1kb) resolved NA >95 >80
http://genome.wustl.edu/projects/detail/reference-genomes-improvement/
CHM13 Draft Assembly (GCA_000983455.1)
• 60X PacBio (P5 and P6 chemistry) • Average read length ~11kb • Daligner/Falcon v 0.2
Total sequence length 2,851,367,788
Number of contigs 2,873
Contig N50 12,981,785
Contig L50 68
CHM13 Hybrid Scaffolds Improve Contiguity
BioNano Map PacBio Assmbly Hybrid Scaffold
# of Contigs 3593 1590 * 254
Min Contig Length 0.08 Mb 0 0.27 Mb
Median Contig Length 0.61 Mb 0.06 Mb 4.35 Mb
Mean Contig Length 0.78 Mb 1.78 Mb 9.68 Mb
Contig N50 1.02 Mb 12.98 Mb 20.79 Mb
Max Contig Length 5.27 Mb 63.15 Mb 82.83 Mb
Total Contig Length 2812 Mb 2824 Mb 2457.75 Mb
*Number of contigs used in hybrid scaffolding
BioNano can be used to size gaps and identify structural variants
Colla
pse
Expa
nsio
n in
Ass
embl
y
Gap in Sequence PacBio Assembly
BioNano Map
SV_TYPES DELETIONS 41 INVERSIONS 10 INSERTIONS 15
TOTAL 66
BioNano alignment to CHM13
BioNano reveals collapse in PacBio assembly due to highly homologous segmental duplications
SD = 96%
CHR1 46746040 46857004 40 W LBHZ01000938.1 110965
CHR1 46857005 47034202 41 N 177198 gap
CHR1 47034203 52157695 42 W LBHZ01000245.1 5123493
PacBio Assembly
BioNano Map
This region is rich in medically relevant genes
chr1 (p33) p31.1 1q12 q41 43 44
CYP4Z2P
CYP4A11
CYP4X1
CYP4Z1
CYP4A22
SegDups
Genes
CHM13
PacBio
LBHZ010000938.1 LBHZ010000938.1
LBHZ010000245.1
This locus has an assigned GRC issue due to unresolved variation and may be a candidate locus for alternative representation in the reference
Reference based Analyses
• 100X Illumina sequence from CHM13 • Align to GRCh37 and GRCh38 with BWA-MEM • Variant calling via SpeedSeq (Chiang et al,
2015) • SNVs, indels: FreeBayes • SVs: LUMPY, SVTyper • CNV: CNVnator
tl;dpa*
• The reference genome assembly is constantly being improved
• New PacBio-based assemblies are orders of magnitude
more contiguous than previous WGS assemblies • Integration of other data (e.g. BioNano, Dovetail) can
improve contiguity even further and be used to identify structurally variant haplotypes that can be added to reference as alternative loci
• Platinum genome sequences integrated into GRCh38
have greatly improved read mapping and variant calling
*too long; didn’t pay attention
Acknowledgements
The McDonnell Genome Institute at Washington University in St. Louis
Rick Wilson Bob Fulton Wes Warren Tina Graves-Lindsay Vince Magrini Sean McGrath Derek Albracht Milinn Kremitzki Susan Rock Debbie Scheer Aye Wollam
The Finishing and Bioinformatics Teams at The Genome Institute
University of Washington Evan Eichler John Huddleston Archana Raja
NCBI Valerie Schneider
University of Pittsburgh School of Medicine (CHM13 cell line)
Urvashi Surti
Personalis Deanna Church
BioNano Genomics Palak Sheth
Pacific Biosciences Jason Chin Nick Sisneros