ABGT 2016 Workshop Schneider

Post on 17-Jan-2017

489 views 1 download

Transcript of ABGT 2016 Workshop Schneider

Relating New Assemblies to the Human Genome

ReferenceValerie Schneider, Ph.D.

NCBI10 February 2016

http://genomereference.org

http://genomereference.org

Twitter: @GenomeRefgrc-announce@ncbi.nlm.nih.gov

Overview

• Changes in reference assembly sequence sources• Diversity• Properties

• Evaluating new sequences for use (Assemblathon)• Future of assembly curation and the reference assembly

GRCh38• 178 regions with alt loci: 2% of

chromosome sequence (61.9 Mb)• 261 Alt Loci: 3.6 Mb novel sequence

relative to chromosomes• Average alt length = 400 kb, max = ~5 Mb

GRCh38

Assembly Composition

WGS Assemblies Contributing to GRCh38

Assembly Name Assembly Accession

Seq Method Usage Length

RP11_1.0_unmatched_regions GCA_000442295.1 454 Gaps, Correction 754717 (0.02%)

CHM1_1.1 GCF_000306695.2 Illumina Gaps, Correction 133662 (0.004%)

HsapALLPATHS1 GCA_000185165.1 Illumina Gaps, Correction 364303 (0.01%)

HuRef GCF_000002125.1 Sanger Gaps, Correction, Alt Loci, CEN 4800690 (0.16%)

LinearCen1.1 (normalized) GCA_000442335.2 Sanger CEN 59546786 (2.02%)

Assembly Composition

WGS Gap Closure

Human assemblies available in the NCBI assembly database

http://www.ncbi.nlm.nih.gov/assembly

Assemblies in GenBank

Oct. 2014: 13 assemblies

Nov. 2015: 28 assembliesFeb. 2016: 39 assemblies

YRI

CEUCEU

CHB

Reference Assembly Basics

Sanger Sanger Illumina Illumina PacBio (older)clone WGS WGS WGS WGS

Reads:Method:

PacBio (newer)WGS

N50:Measure of continuity.Half of the contigs in the assembly are this length or greater.

Overview

• Changes in reference assembly sequence sources• Diversity• Properties

• Evaluating new sequences for use (Assemblathon)• Future of assembly curation and the reference assembly

Assemblathon Analysis Overview

CHM1/CHM13 Assemblathon Goals• Assess aspects of data generation (coverage, length)• Assess assembler algorithms & parameters• Platinum genome selection (MGI)• More robust reference curation (GRC)• Set expectations for these new resources• Understand quality and limitations• Plan for regions needing other resources• Develop new pipelines and SOPs

GRCh38 CHM13_FC CHM13_CA1 CHM13_CA2 CHM13_CA3 CHM13_CA4

GCF_000001405.26 GCA_000983455.2 GCA_000983465.1 GCA_001015355.1 GCA_000983475.1 GCA_001015385.1Total sequences 50,304 50,304 50,304 50,304 50,304 50,304

No Alignment 21 (0.04%) 88 (0.17%) 50 (0.10%) 49 (0.10%) 46 (0.09%) 50 (0.10%)Multiple best alignments (split transcripts) 10 (0.02%) 40 (0.08%) 340 (0.68%) 316 (0.63%) 611 (1.22%) 395 (0.79%)CDS coverage < 95% 17 (0.04%) 256 (0.66%) 378 (0.97%) 326 (0.84%) 622 (1.60%) 392 (1.01%)Dropped at consolidation (coding) 0 167 259 278 240 250Dropped at consolidation (non-coding) 0 138 212 209 185 191

Assemblathon RefSeq Alignment Stats: CHM13

GRCh38 CHM13_FC CHM13_CA1 CHM13_CA2 CHM13_CA3 CHM13_CA4

Frameshifts GCF_000001405.26 GCA_000983455.2 GCA_000983465.1 GCA_001015355.1 GCA_000983475.1 GCA_001015385.1

proteins 19 218 346 503 627 439

genes 12 161 232 317 366 281

Number proteins

Assemblies in which frameshifted

953 1

106 2

50 3

113 4

115 5

41 6

2 (PKD1L2) 7

Assemblathon RefSeq Alignment Stats: CHM13

Seq inassembly 1

Seq inassembly 2

A A

B

B’

B

Unique well aligned region in both assemblies.

Second Pass (SP) alignments

SP onlyExpansion Assembly 1

SP + FPCollapse Assembly 2

Graphic: Deanna Church

First Pass (FP) alignments

Assemblathon: Assembly-Assembly Alignments

Assembly Average

CHM13_FC 2.36%

CHM13_CA1 2.38%

CHM13_CA2 2.41%

CHM13_CA3 2.03%

CHM13_CA4 2.13%

GRCh37 1.06%

ungapped

Overview

• Changes in reference assembly sequence sources• Diversity• Properties

• Evaluating new sequences for use (Assemblathon)• Future of assembly curation and the reference assembly

• Platinum and gold genomes expected to contribute to reference corrections and alternate loci• Set standards for use of other WGS assemblies

• Gold and platinum assembly curation• Tools for local re-assembly• Assessing and communicating local assembly quality

GRCh38CHM1

CHM13

NA19240NA12878

NA19434

HG007033

HG00514

Future Curation

• Multiple human references• Reference graphs• Long-term curation

Future Curation

CHM1

CHM13

NA19240

NA12878

HG000733

HG00514

NA19434

GRCh38

Overview

• Changes in reference assembly sequence sources• Diversity• Properties

• Evaluating new sequences for use (Assemblathon)• Future of assembly curation and the reference assembly

GRCh38 Credits

GRCh38 Collaborators• NCBI RefSeq and gpipe annotation team• Havana annotators• Karen Miga• David Schwartz• Steve Goldstein• Mario Caceres• Giulio Genovese• Jeff Kidd• Peter Lansdorp• Mark Hills• David Page• Jim Knight• Stephan Schuster• 1000 Genomes

GRC SAB• Rick Myers• Granger Sutton• Evan Eichler• Jim Kent• Roderic Guigo• Carol Bult• Derek Stemple• Jan Korbel• Liz Worthey• Matthew Hurles• Richard Gibbs

Assemblathon Collaborators• Jason Chin• Adam Phillippy• Sergey Koren• Heng Li

GRCTina Graves-LindsayKaryn Meltz SteinbergKerstin HoweRichard DurbinPaul FlicekLaura ClarkeDeanna ChurchCurators!Developers!