Post on 17-Jan-2017
Relating New Assemblies to the Human Genome
ReferenceValerie Schneider, Ph.D.
NCBI10 February 2016
http://genomereference.org
http://genomereference.org
Twitter: @GenomeRefgrc-announce@ncbi.nlm.nih.gov
Overview
• Changes in reference assembly sequence sources• Diversity• Properties
• Evaluating new sequences for use (Assemblathon)• Future of assembly curation and the reference assembly
GRCh38• 178 regions with alt loci: 2% of
chromosome sequence (61.9 Mb)• 261 Alt Loci: 3.6 Mb novel sequence
relative to chromosomes• Average alt length = 400 kb, max = ~5 Mb
GRCh38
Assembly Composition
WGS Assemblies Contributing to GRCh38
Assembly Name Assembly Accession
Seq Method Usage Length
RP11_1.0_unmatched_regions GCA_000442295.1 454 Gaps, Correction 754717 (0.02%)
CHM1_1.1 GCF_000306695.2 Illumina Gaps, Correction 133662 (0.004%)
HsapALLPATHS1 GCA_000185165.1 Illumina Gaps, Correction 364303 (0.01%)
HuRef GCF_000002125.1 Sanger Gaps, Correction, Alt Loci, CEN 4800690 (0.16%)
LinearCen1.1 (normalized) GCA_000442335.2 Sanger CEN 59546786 (2.02%)
Assembly Composition
WGS Gap Closure
Human assemblies available in the NCBI assembly database
http://www.ncbi.nlm.nih.gov/assembly
Assemblies in GenBank
Oct. 2014: 13 assemblies
Nov. 2015: 28 assembliesFeb. 2016: 39 assemblies
YRI
CEUCEU
CHB
Reference Assembly Basics
Sanger Sanger Illumina Illumina PacBio (older)clone WGS WGS WGS WGS
Reads:Method:
PacBio (newer)WGS
N50:Measure of continuity.Half of the contigs in the assembly are this length or greater.
Overview
• Changes in reference assembly sequence sources• Diversity• Properties
• Evaluating new sequences for use (Assemblathon)• Future of assembly curation and the reference assembly
Assemblathon Analysis Overview
CHM1/CHM13 Assemblathon Goals• Assess aspects of data generation (coverage, length)• Assess assembler algorithms & parameters• Platinum genome selection (MGI)• More robust reference curation (GRC)• Set expectations for these new resources• Understand quality and limitations• Plan for regions needing other resources• Develop new pipelines and SOPs
GRCh38 CHM13_FC CHM13_CA1 CHM13_CA2 CHM13_CA3 CHM13_CA4
GCF_000001405.26 GCA_000983455.2 GCA_000983465.1 GCA_001015355.1 GCA_000983475.1 GCA_001015385.1Total sequences 50,304 50,304 50,304 50,304 50,304 50,304
No Alignment 21 (0.04%) 88 (0.17%) 50 (0.10%) 49 (0.10%) 46 (0.09%) 50 (0.10%)Multiple best alignments (split transcripts) 10 (0.02%) 40 (0.08%) 340 (0.68%) 316 (0.63%) 611 (1.22%) 395 (0.79%)CDS coverage < 95% 17 (0.04%) 256 (0.66%) 378 (0.97%) 326 (0.84%) 622 (1.60%) 392 (1.01%)Dropped at consolidation (coding) 0 167 259 278 240 250Dropped at consolidation (non-coding) 0 138 212 209 185 191
Assemblathon RefSeq Alignment Stats: CHM13
GRCh38 CHM13_FC CHM13_CA1 CHM13_CA2 CHM13_CA3 CHM13_CA4
Frameshifts GCF_000001405.26 GCA_000983455.2 GCA_000983465.1 GCA_001015355.1 GCA_000983475.1 GCA_001015385.1
proteins 19 218 346 503 627 439
genes 12 161 232 317 366 281
Number proteins
Assemblies in which frameshifted
953 1
106 2
50 3
113 4
115 5
41 6
2 (PKD1L2) 7
Assemblathon RefSeq Alignment Stats: CHM13
Seq inassembly 1
Seq inassembly 2
A A
B
B’
B
Unique well aligned region in both assemblies.
Second Pass (SP) alignments
SP onlyExpansion Assembly 1
SP + FPCollapse Assembly 2
Graphic: Deanna Church
First Pass (FP) alignments
Assemblathon: Assembly-Assembly Alignments
Assembly Average
CHM13_FC 2.36%
CHM13_CA1 2.38%
CHM13_CA2 2.41%
CHM13_CA3 2.03%
CHM13_CA4 2.13%
GRCh37 1.06%
ungapped
Overview
• Changes in reference assembly sequence sources• Diversity• Properties
• Evaluating new sequences for use (Assemblathon)• Future of assembly curation and the reference assembly
• Platinum and gold genomes expected to contribute to reference corrections and alternate loci• Set standards for use of other WGS assemblies
• Gold and platinum assembly curation• Tools for local re-assembly• Assessing and communicating local assembly quality
GRCh38CHM1
CHM13
NA19240NA12878
NA19434
HG007033
HG00514
Future Curation
• Multiple human references• Reference graphs• Long-term curation
Future Curation
CHM1
CHM13
NA19240
NA12878
HG000733
HG00514
NA19434
GRCh38
Overview
• Changes in reference assembly sequence sources• Diversity• Properties
• Evaluating new sequences for use (Assemblathon)• Future of assembly curation and the reference assembly
GRCh38 Credits
GRCh38 Collaborators• NCBI RefSeq and gpipe annotation team• Havana annotators• Karen Miga• David Schwartz• Steve Goldstein• Mario Caceres• Giulio Genovese• Jeff Kidd• Peter Lansdorp• Mark Hills• David Page• Jim Knight• Stephan Schuster• 1000 Genomes
GRC SAB• Rick Myers• Granger Sutton• Evan Eichler• Jim Kent• Roderic Guigo• Carol Bult• Derek Stemple• Jan Korbel• Liz Worthey• Matthew Hurles• Richard Gibbs
Assemblathon Collaborators• Jason Chin• Adam Phillippy• Sergey Koren• Heng Li
GRCTina Graves-LindsayKaryn Meltz SteinbergKerstin HoweRichard DurbinPaul FlicekLaura ClarkeDeanna ChurchCurators!Developers!