Understanding the reference assembly: CSHL Hackathon
-
Upload
genome-reference-consortium -
Category
Science
-
view
84 -
download
1
Transcript of Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly
Valerie SchneiderNCBI
26 October 2016
http://www.biorxiv.org/content/early/2016/08/30/072116
Dilthey et al.Paten et al.
Scientific Models
• Distinguishing features of the human reference assembly• Implications for genomic analyses and tools• Where do you get assembly-relevant data?
Outline
Assembly BasicsSanger-seq’d, clone based assembly BAC insert
BAC vector
Shotgun sequence clone
Assemble
GAPS
Finish
Minimal Tiling Path
Define switch points for adjacent components(haploid mosaic)
Most contiguousHighest sequence quality
Today’s reference assembly does not represent:1.The most common allele
2.The longest allele3.The ancestral allele
Assembly Basics
It represents the sequence available from the HGP
GRC Assembly Model
Sequences from haplotype 1Sequences from haplotype 2
Old Assembly model: compress into a consensus
Current Assembly model: represent both haplotypesmany
Assembly (e.g. GRCh38)
Primary Assembly
Unit
Non-nuclear assembly unit
(e.g. MT)
PAR
Genomic Region(MHC)
Genomic Region
(UGT2B17)Genomic
Region(MAPT)
Church et al., PLoS Biol. 2011 Jul;9(7):e1001091 GRC Assembly Model
ALT 2
ALT 3
ALT 4
ALT 5
ALT 6
ALT 7
ALT 1
The alignments of the alternate loci scaffolds to the chromosomes are an integral part of the assembly and can be downloaded from GenBank with the assembly sequences
Assembly (e.g. GRCh38.p1)
Primary Assembly
Unit
Non-nuclear assembly unit
(e.g. MT)
ALT 1
ALT 2
ALT 3
ALT 4
ALT 5
ALT 6
ALT 7
PAR
Genomic Region(MHC)
Genomic Region
(UGT2B17)Genomic
Region(MAPT)
Church et al., PLoS Biol. 2011 Jul;9(7):e1001091
Patches
Genomic Region(ABO)
Genomic Region
(FOXO6)Genomic
Region(FCGBP)
GRC Assembly Model
Patches
FIX NOVEL
SCAFFOLD STATUS AT NEXTMAJOR ASSEMBLY RELEASE
ALT LOCI
--(integrated)
Treat as: Allelic
Treat as: Preferred
1q32 1q21 1p21
Dennis et al., 2012
GRC Assembly Model
GRC: Assembly Model
GRCh38• 178 regions with alt loci: 2% of
chromosome sequence (61.9 Mb)• 261 Alt Loci: 3.6 Mb novel sequence
relative to chromosomes
GRCh38.p9• 96 Patches: >1 Mb novel
sequence• 48 FIX• 48 NOVEL
GRC: Assembly Model
GRCh38: Alt Loci
Alignment Legend
no alignmentmismatchdeletion
chromosome
alt/patch
reads On-target alignment
Off-target alignments
(n=122,922)
GRCh38: Alt LociPLoS Biol. 2011 Jul;9(7):e1001091
Anatomy of an alt
Anatomy of an alt
AC012314.8
CU151838.1
ALT LOCI
AC012314.8
AC245052.3 CHR. 19
Due to anchor components, alternate loci contain some sequence that is redundant to the primary assembly unit
GRCh38 Model CentromeresKaren Miga (Kent Lab, UCSC)
GRCh38 Model Centromeres
WGS WGS WGS
GRCh38 Centromeres
Miga et al., Genome Res. 2014 Apr;24(4):697-707
GRCh38: Where’s the data?
GRCh38: Where’s the data?
GRCh38 Sequences for alignment pipelines
GRCh38: Where’s the data?
Assembly Sequence and Statistics Reports
GRCh38: Where’s the data?
GRCh38: Where’s the data?
GRCh38: Where’s the data?
Assembly Regions Report: Alts, Patches and Centromeres
GRCh38: Where’s the data?
GRCh38: Where’s the data?
GRCh38: Where’s the data?
Accessing the Datahttps://genomereference.org
Accessing the Datahttps://genomereference.org
Dumped daily
Frozen mappings to prior assembly versions in GFF3
Accessing the Datahttps://genomereference.org
Mapped to latest GRCh38 and GRCh37.p13
Accessing the Datahttps://genomereference.org
GRCh38 Collaborators• NCBI RefSeq and gpipe annotation team• Havana annotators• Karen Miga• David Schwartz• Steve Goldstein• Mario Caceres• Giulio Genovese• Jeff Kidd• Peter Lansdorp• Mark Hills• David Page• Jim Knight• Stephan Schuster• 1000 Genomes
GRC SAB• Rick Myers• Granger Sutton• Evan Eichler• Jim Kent• Roderic Guigo• Carol Bult• Derek Stemple• Jan Korbel• Liz Worthey• Matthew Hurles• Richard Gibbs
GRC Creditshttps://genomereference.org
Alt Loci: Informatics Challenges
Masks and alt aware aligners reduce the incidence of ambiguous alignments observed when aligning
reads to the full assembly
Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffoldsSimulated Reads
GRCh38: Alt Loci
The Changing Reference
The Changing Reference
Dilthey et al.Paten et al.
The Changing Reference
• Distinguishing features of the human reference assembly• Implications for genomic analyses and tools• Where do you get assembly-relevant data?
Outline