150224 grc kms
-
Upload
genome-reference-consortium -
Category
Health & Medicine
-
view
832 -
download
0
Transcript of 150224 grc kms
![Page 1: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/1.jpg)
Characterizing extreme diversity in
the human genome using a single
haplotype genomic resource
Karyn Meltz Steinberg, Ph.D.
AGBT 2015 GRC Workshop
@KMS_Meltzy
![Page 2: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/2.jpg)
1 bp 1 chr
Fre
quency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
Types of genetic variants
Slide courtesy of S. Girirajan
Human Genetic Variation
![Page 3: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/3.jpg)
1 bp 1 chr
Fre
quency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
Types of genetic variants
1 bp 1 chr
Thro
ughput
1 kb 1 Mb
Size of variant
How do we assay them?
Slide courtesy of S. Girirajan
Human Genetic Variation
![Page 4: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/4.jpg)
1 bp 1 chr
Fre
quency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
Types of genetic variants
SNP genotyping
1 bp 1 chr
Thro
ughput
1 kb 1 Mb
Size of variant
How do we assay them?
Slide courtesy of S. Girirajan
Human Genetic Variation
![Page 5: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/5.jpg)
1 bp 1 chr
Fre
quency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
Types of genetic variants
Array-CGH
Karyotyping
SNP genotyping
1 bp 1 chr
Thro
ughput
1 kb 1 Mb
Size of variant
How do we assay them?
Slide courtesy of S. Girirajan
Human Genetic Variation
![Page 6: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/6.jpg)
1 bp 1 chr
Fre
quency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
Types of genetic variants
Array-CGH
Karyotyping
Sequencing
SNP genotyping
1 bp 1 chr
Thro
ughput
1 kb 1 Mb
Size of variant
How do we assay them?
Slide courtesy of S. Girirajan
Human Genetic Variation
![Page 7: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/7.jpg)
Extreme diversity in the human genome
• <99.5% identity to the reference
• Refractory to traditional sequencing efforts
• Loci often contain gene families associated with
immune response and xenobiotic metabolism
![Page 8: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/8.jpg)
HLA is a classic example of an extremely diverse locus
• Critical to immune response
• Characterized by overdominant
selection
• Alleles are linked and segregate as
distinct haplotypes
• Shaped by gene duplication and
diversification
![Page 9: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/9.jpg)
![Page 10: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/10.jpg)
![Page 11: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/11.jpg)
![Page 12: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/12.jpg)
![Page 13: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/13.jpg)
Segmental duplications can predispose loci to further
rearrangement via NAHR
![Page 14: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/14.jpg)
Segmental duplications can predispose loci to further
rearrangement via NAHR
![Page 15: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/15.jpg)
A
A
C
T
C
G
C
C
Repeat Copies (noted by color difference)
Allelic
Copies
Diploid Genome
With a diploid genome, there is significant ambiguity
sorting allelic copies from repeat copies
A C C C
Haploid Genome
Repeat Copies
(ONLY but noted by color differences)
With a haploid genome, allelic differences are eliminated, and
base differences are likely indicative of repeat copies
![Page 16: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/16.jpg)
Hydatidiform mole
![Page 17: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/17.jpg)
SRGAP2 Homology between genes
Shows nearly identical segments between SRGAP2A and SRGAP2 paralogs
Shows homology between SRGAP2B and SRGAP2C
Dennis, et.al. 2012
SRGAP2A
SRGAP2B
SRGAP2C
![Page 18: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/18.jpg)
1q21
1q21 patch alignment to chromosome 1
1q32 1q21 1p21
![Page 19: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/19.jpg)
Hydatidiform mole
Let’s sequence and assemble the whole genome!
![Page 20: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/20.jpg)
CHM1_1.1 Assembly
• Reference-guided assembly • SRPRISM v2.3, R. Agarwala
• Alignment of Illumina reads to GRCh37 primary assembly
• CHORI-17 BAC clone tilepaths were then incorporated
• 428 total clones
• 324 clones in 45 tilepaths
• 104 clones as singletons
http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.
2
Total Sequence Length 3,037,866,619 bp
Total Assembly Gap Length 210,229,812 bp
Number of Scaffolds 163
Scaffold N50 50,362,920 bp
CHM1 Assembly Paper - Genome Research Steinberg et al. 2014
![Page 21: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/21.jpg)
CHM1_1.1 assembly is highly contiguous compared to
other WGS based assemblies
![Page 22: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/22.jpg)
Integrating BAC tiling paths improved assembly
![Page 23: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/23.jpg)
Integrating BAC tiling paths improved assembly
![Page 24: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/24.jpg)
Alignment of CHM1 Illumina data to assembly revealed
regions of extreme heterogeneity
Heterozygous Homozygous Total
Variants 64033 22513 86546
In RepeatMasked (RM) sequence 37060 14833 51893
In Segmental duplication (SD) 30670 4843 35513
In RM and SD 51466 17174 68640
Ts:Tv 1.5 0.7 1.2
Mean SNV density/kb 0.02 0.008 0.03
There are significantly more heterozygous variants in repetitive
sequence than expected (p<1x10-16). BAC ends mapping discordantly
and in multiple loci are significantly enriched for segmental
duplications (p<1x10-5).
![Page 25: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/25.jpg)
Identified 549 novel protein coding genes not annotated
in GRCh37
![Page 26: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/26.jpg)
CHM1 BioNano Genome Map Aligned to GRCh38
GRCh38
CHM1 BioNano Map~15kb additional data
![Page 27: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/27.jpg)
BioNano SV Calls Identified a Assembly Problems
Collapse
Expansi
on
in A
ssem
bly
Gap in SequenceCHM1_1.1 Assembly
CHM1 BioNano Map
![Page 28: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/28.jpg)
Conclusion
• Extremely diverse regions of the genome are difficult to
characterize due to issues distinguishing allelic from
paralogous duplications
• CHM1_1.1 highly contiguous single haplotype
representation of the genome
• Identified regions of misassembly or reference-ized
regions
• Utilize long read technology and nanopore technology to
attempt to fix these regions
![Page 29: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/29.jpg)
![Page 30: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/30.jpg)
![Page 31: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/31.jpg)
Need to add more diversity to reference
• Finish another hydatidiform mole to platinum
status
• Finish 5 genomes to gold status
• NA19240 (Yoruban)
• NA12878 (European)
• HG00513 (Han Chinese)
• 2 “wildcards”
• Looking for underrepresented minority population
• Add high quality alternative sequences to
reference to create a population reference graph
or “pan genome”
![Page 32: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/32.jpg)
Use colored de Bruijn graph structure to represent
population reference graph
![Page 33: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/33.jpg)
Bioinformatic tool development in the future
• Alignment of short reads to population reference
graph
• Variant calling
• Variant reporting/Haplotype resolution
![Page 34: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/34.jpg)
Adapted from Weinstein et al, 2009
![Page 35: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/35.jpg)
The GRCh37 reference sequence was assembled
from three lymphoblastoid cell lines
Not a true haplotype
Incomplete
![Page 36: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/36.jpg)
The CH17 haplotype is quite different from the reference
![Page 37: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/37.jpg)
Novel insertion
The CH17 haplotype is quite different from the reference
![Page 38: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/38.jpg)
Complex Indel
The CH17 haplotype is quite different from the reference
![Page 39: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/39.jpg)
Hotspot/Recurrent Mutation
The CH17 haplotype is quite different from the reference
![Page 40: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/40.jpg)
60 kbp Insertion
(Hotspot)
African Asian European
![Page 41: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/41.jpg)
Duplication (influenza)
The CH17 haplotype is quite different from the reference
![Page 42: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/42.jpg)
44 kbp Duplication
(influenza)
African Asian European
![Page 43: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/43.jpg)
Summary of hydatidiform mole sequence
• 47 functional V genes
• 24 total variants (SNV and CNV) involving 29 IGHV
genes
• 5 structural variants
• 19 single nucleotide variants
• 15 non-synonymous mutations
• 20 out of 24 variants represent differences in amino acid
sequence or gene copy number
![Page 44: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/44.jpg)
Summary of hydatidiform mole sequence
• 47 functional V genes
• 24 total variants (SNV and CNV) involving 29 IGHV
genes
• 5 structural variants
• 19 single nucleotide variants
• 15 non-synonymous mutations
• 20 out of 24 variants represent differences in amino acid
sequence or gene copy number
100 kbp of novel sequence
![Page 45: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/45.jpg)
Current status of CHM1 resources
• CHORI-17 BAC Library (created from CHM1 cell line)
• CHORI-17 BAC end sequences (n=325,659)
• CHORI-17 multiple enzyme fingerprint map (1560 fpc contigs)
• CHORI-17 BACs (>750 have been sequenced, with 592 of them in
Genbank as phase 3)
• Active cell line
• >100X coverage Illumina 100bp reads
• 300, 500bp, 3kb inserts
• Reference assisted assembly CHM1_1.1
• BioNano genome map
• >50X coverage of PacBio long read data
![Page 46: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/46.jpg)
CHM1_1.1 Assembly
• Reference-guided assembly – SRPRISM v2.3, R. Agarwala
• Alignment of Illumina reads to GRCh37 primary assembly
• CHORI-17 BAC clone tilepaths were then incorporated
• 428 total clones
• 324 clones in 45 tilepaths
• 104 clones as singletons
• Comparison back to GRCh37 reference to provide appropriate gaps sizes
• Assembly submitted to Genbank
• http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2
• Steinberg et al, 2014• Genome Research (Dec;24(12):2066-76)
![Page 47: 150224 grc kms](https://reader033.fdocuments.us/reader033/viewer/2022051617/55a515e51a28abed7f8b45d4/html5/thumbnails/47.jpg)
LILR (leukocyte
immunoglobulin-like
receptor)/KIR (killer
immunoglobulin receptor)
Immunoglobulin Kappa chain
Immunoglobulin Lambda chain
TCRA/B
17q21.31 inversion
polymorphism
Immunoglobulin
heavy chain locus
CYP2D6
SRGAP2
15q13.3
inversion
polymorphism