Understanding Human Variation - EMBL-EBI · 2013-04-16 · Understanding Human Variation Fiona...
Transcript of Understanding Human Variation - EMBL-EBI · 2013-04-16 · Understanding Human Variation Fiona...
Understanding Human Variation
Fiona Cunningham European Bioinformatics Institute
November 2012
Talk outline
• Gene-c varia-on – Different types – Origins
• Why are all those variants important? – Importance and prac-cal applica-ons
• How is varia-on data discovered? – Inves-ga-ng gene-c varia-on and progress over -me
• Ensembl and modern Bioinforma-cs – Building infrastructure for research – Interpre-ng variants
The Reference Human Genome • Published 2001 • Finished in 2004 • Still incomplete
4/75
Every individual has a unique genome
5/75
ACCCAATAGCAGAACAGCTACTGGAACTAAAATCCTCTGATTTCAAATAACAGCCCCGCCCACTACCACTAAGTGAAGTCATCCACAACCACACACCGACCACTCTAAGCTTTTGTAAGATCGGCTCGCTTTGGGGAACAGGTCTTGAGAGAACATCCCTTTTAAGGTCAGAACAAAGGTATTTCATAGGTCCCAGGTCGTGTCCCGAGGGCGCCCACCCAAACATGAGCTGGAGCAAAAAGAAAGGGATGGGGGACTTGGAGTAGGCATAGGGGCGGCCCCTCCAAGCAGGGTGGCCTGGGACTCTTAAGGGTCAGCGAGAAGAGAACACACACTCCAGCTCCCGCTTTATTCGGTCAGATACTGACGGTTGGGATGCCTGACAAGGAATTTCCTTTCGCCACACTGAGAAATACCCGCAGCGGCCCACCCAGGCCTGACTTCCGGGTGGTGCGTGTGCTGCGTGTCGCGTCACGGCGTCACGTGGCCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAACTAGGCGGCAGAGGCGGAGCCGCTGTGGCACTGCTGCGCCTCTGCTGCGCCTCGGGTGTCTTTTGCGGCGGTGGGTCGCCGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGGTTAGTGGTGGTGGTAGTGGGTTGGGACGAGCGCGTCTTCCGCAGTCCCAGTCCAGCGTGGCGGGGGAGCGCCTCACGCCCCGGGTCGCTGCCGCGGCTTCTTGCCCTTTTGTCTCTGCCAACCCCCACCCATGCCTGAGAGAAAGGTCCTTGCCCGAAGGCAGATTTTCGCCAAGCAAATTCGAGCCCCGCCCCTTCCCTGGGTCTCCATTTCCCGCCTCCGGCCCGGCCTTTGGGCTCCGCCTTCAGCTCAAGACTTAACTTCCCTCCCAGCTGTCCCAGATGACGCCATCTGAAATTTCTTGGAAACACGATCACTTTAACGGAATATTGCTGTTTTGGGGAAGTGTTTTACAGCTGCTGGGCACGCTGTATTTGCCTTACTTAAGCCCCTGGTAATTGCTGTATTCCGAAGACATGCTGATGGGAATTACCAGGCGGCGTTGGTCTCTAACTGGAGCCCTCTGTCCCCACTAGCCACGCGTCACTGGTTAGCGTGATTGAAACTAAATCGTATGAAAATCCTCTTCTCTAGTCGCACTAGCCACGTTTCGAGTGCTTAATGTGGCTAGTGGCACCGGTTTGGACAGCACAGCTGTAAAATGTTCCCATCCTCACAGTAAGCTGTTACCGTTCCAGGAGATGGGACTGAATTAGAATTCAAACAAATTTTCCAGCGCTTCTGAGTTTTACCTCAGTCACATAATAAGGAATGCATCCCTGTGTAAGTGCATTTTGGTCTTCTGTTTTGCAGACTTATTTACCAAGCATTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGGTATTGACAAATTTTATATAACTTTATAAATTACACCGAGAAAGTGTTTTCTAAAAAATGCTTGCTAAAAACCCAGTACGTCACAGTGTTGCTTAGAACCATAAACTGTTCCTTATGTGTGTATAAATCCAGTTAACAACATAATCATCGTTTGCAGGTTAACCACATGATAAATATAGAACGTCTAGTGGATAAAGAGGAAACTGGCCCCTTGACTAGCAGTAGGAACAATTACTAACAAATCAGAAGCATTAATGTTACTTTATGGCAGAAGTTGTCCAACTTTTTGGTTTCAGTACTCCTTATACTCTTAAAAATGATCTAGGACCCCCGGAGTGCTTTTGTTTATGTAGCTTACCATATTAGAAATTTAAAACTAAGAATTTAAGGCTGGGCGTGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACTTGAGGCCAGAAGTTTGAGACCAGCCTGGCCAACATGGTGAAACCCTATCTCTACTAAAAATACAAAAAATGTGCTGCGTGTGGTGGTGCGTGCCTGTAATCCCAGCTACACGGGAGGTGGAGGCAGGAGAATCGCTTGAACCCTGGAGGCAGAGGTTGCAGTGAGCCAAGATCATGCCACTGCACTCTAGCCTGGGCCACATAGCATGACTCTGTCTCAAAACAAACAAACAAACAAAAAACTAAGAATTTAAAGTTAATTTACTTAAAAATAATGAAAGCTAACCCATTGCATATTATCACAACATTCTTAGGAAAAATAACTTTTTGAAAACAAGTGAGTGGAATAGTTTTTACATTTTTGCAGTTCTCTTTAATGTCTGGCTAAATAGAGATAGCTGGATTCACTTATCTGTGTCTAATCTGTTATTTTGGTAGAAGTATGTGAAAAAAAATTAACCTCACGTTGAAAAAAGGAATATTTTAATAGTTTTCAGTTACTTTTTGGTATTTTTCCTTGTACTTTGCATAGATTTTTCAAAGATCTAATAGATATACCATAGGTCTTTCCCATGTCGCAACATCATGCAGTGATTATTTGGAAGATAGTGGTGTTCTGAATTATACAAAGTTTCCAAATATTGATAAATTGCATTAAACTATTTTAAAAATCTCATTCATTAATACCACCATGGATGTCAGAAAAGTCTTTTAAGATTGGGTAGAAATGAGCCACTGGAAATTCTAATTTTCATTTGAAAGTTCACATTTTGTCATTGACAACAAACTGTTTTCCTTGCAGCAACAAGATCACTTCATTGATTTGTGAGAAAATGTCTACCAAATTATTTAAGTTGAAATAACTTTGTCAGCTGTTCTTTCAAGTAAAAATGACTTTTCATTGAAAAAATTGCTTGTTCAGATCACAGCTCAACATGAGTGCTTTTCTAGGCAGTATTGTACTTCAGTATGCAGAAGTGCTTTATGTATGCTTCCTATTTTGTCAGAGATTATTAAAAGAAGTGCTAAAGCATTGAGCTTCGAAATTAATTTTTACTGCTTCATTAGGACATTCTTACATTAAACTGGCATTATTATTACTATTATTTTTAACAAGGACACTCAGTGGTAAGGAATATAATGGCTACTAGTATTAGTTTGGTGCCACTGCCATAACTCATGCAAATGTGCCAGCAGTTTTACCCAGCATCATCTTTGCACTGTTGATACAAATGTCAACATCATGAAAAAGGGAAATGATTCCATAGCGTTATTATGAAAGTAGTTTTGAACTGTAATGGTAGAGGATGAATAGCTCACAATACAAATTTGTCATTTCCCTTTAAGAGAGAATTCCCATTTTATGTGAGAGTCCACATGTTCCTCATACCCATAGTTTGCCACATCTTGAGTACTCTTCAGAATTATTTGAATTTTTTGAATTTTATCTGTGGAATGTATTTTTTTTTTTTTCTTTTTTGAGACACAGTCTTGCT
T T
A
T
G
T
C
C
C
C
Single nucleotide variants
A single nucleo-de variant is a change that happens at one posi-on in the DNA sequence. A single nucleo-de polymorphism (SNP): (In double-‐stranded DNA, this changes a base pair).
Person 1. TTCCCTA Person 2. TTCCTTA
Other short variants
C G T G
A T G G G C A C T T
Insertion
T
Dinucleotide Substitution
C T
T
Deletion
• Structural variants: large • deletions • duplications • insertions • translocations
• Copy number variants (CNVs): sequence repeated ‘n’ times in an individual
Large scale: >50 base pairs to megabases
deletion duplication
translocation insertion
SNP SNP
SNP SNP
SNP Appearance of new variants by
mutation
SNP SNP
SNP SNP
SNP Survival of alleles through early
generations against the odds
SNP Increase of the allele
to a substantial population frequency
Fixation of the allele in populations
Origin of Variants E.g. more copies
of CCL3L1 HIV resistance
Germline variation: passed to descendants. Somatic Mutation: not passed to descendants.
Talk outline
• Gene-c varia-on – Different types – Origins
• Why are all those variants important? – Importance and prac-cal applica-ons
• Where did they all come from? – Inves-ga-ng gene-c varia-on and progress over -me
• Ensembl and modern Bioinforma-cs – Building infrastructure for research – Interpre-ng variants
Disease and differences • Varia-on: interes-ng for evolu-on, popula-on migra-on and adapta-on
– Differences in phenotype: Height, intelligence, body mass – Single variant disorders: Sickle cell anaemia, cys-c fibrosis – Complex Disease: Bipolar disorder, schizophrenia, Alzheimer’s – Noravirus protec-on (Homozygous for alt allele rs601338)
• SV, Copy number varia-on: Gene dosage -‐ too few or too many copies – lupus, autoimmune disease: too few copies of FCGR3B – HIV infec-on resistance: more copies of CCL3L1 – Intellectual disorders
Prac-cal applica-ons of varia-on
Risk assessment • Of radia-on exposure, mutagenic chemicals and
cancer-‐causing toxins
Anthropology, evolu?on, and human migra?on • muta-ons lineages, mitochondrial inheritance and Y
chromosomes • compara-ve genomics: for understanding diseases
and traits.
Molecular and clinical medicine • Diagnosis, detec-on and treatment:
– e.g. myotonic dystrophy, fragile X syndrome, inherited colon cancer, Alzheimer's disease, and familial breast cancer
• Pharmacogenomics "custom drugs"
Prac-cal applica-ons of varia-on DNA forensics • Iden-fica-on of
– suspects – exonerate innocents – catastrophe vic-ms – endangered species (against poachers)
Agriculture, livestock breeding • Disease-‐, insect-‐, and drought-‐resistant crops • Healthier, more produc-ve, disease-‐resistant farm animals • More nutri-ous produce • Reducing the costs of agriculture
Talk outline
• Gene-c varia-on – Different types – Origins
• Why are all those variants important? – Importance and prac-cal applica-ons
• How is varia-on data discovered? – Inves-ga-ng gene-c varia-on and progress over -me
• Ensembl and modern Bioinforma-cs – Building infrastructure for research – Interpre-ng variants
Mendel (1822 – 1884)
• "father of gene-cs" for his study of the inheritance of traits in pea plants.
• 1866 -‐Published the results of the inheritance of "factors" in pea plants
• Paaerns in pea traits explained by inherited factors
SNP Consor-um (TSC)
• 1999: private /public collabora-on • Share costs to produce a public resource of single nucleo-de polymorphisms (SNPs)
• Goal: discover 300 000 SNPs in two years
• Result: 1.4 million SNPs by 2001 • 24 people represen-ng several races
Loca-on by mapping flanking sequence
Genome Sequencing
• 13-year project • 2001: Human genome working
drafts • Data unit of approximately 10x
coverage of human • 10 years and cost about $3 billion • Olympics 2012: $19 billion
Human Genome Project
Finding all human SNPs
• 3 major popula-ons
• Alleles and frequencies
• Tag variants
HapMap Project- 2002 Goal: find all SNPs present across different populations (“all” means present at at least 5%)
h6p://hapmap.ncbi.nlm.nih.gov/
Haplotypes and LD
• A haplotype can be thought of as a collection of alleles. • ‘LD’ (Linkage Disequilibrium): a measure of how likely two alleles will
be inherited together
Important project. S-ll very highly regarded today.
Associa-on studies
• Genome Wide Associa?on Studies (GWAS)
• E.g. WTCCC 2005 • Gather phenotypes
Use common SNPs to understand common disease Diseases: diabetes, Crohn’s disease, breast cancer, coronary artery disease, bipolar disorder, hypertension, multiple sclerosis,…
0
1000
2000
3000
4000
5000
6000
19961997199819992000200120022003200420052006200720082009
YearD
isks
(T
B)
1000 Genome Project -‐ Genome Sequencing
Finding all human SNPs
• 2008: World-wide capacity dramatically increasing
• Goal: Find genetic variants with frequencies of >1%
• In 3 weeks data double that of past 13 years
Lactose tolerance
1000 Genomes Populations
YRI Yoruba MKK
Maasai
LWK Luhya
ASW African
TSI Toscan
CHS Han (South)
CHD Chinese
JPT Japanese
MEX Mexican
GIH Gujarati
CEU Northern and Western European
GBR British
IBS Spain
FIN Finnish CHB
Han Chinese CDX Chinese Dai
KHV Vietnam
GWD The Gambia
ACB Barbados
AJM African
PUR Puerto Rican
CLM Colombian
PEL Peruvian
PJL Pakistani
Today…
• 2012: Every 14 minutes (£4000) – £600 exome
• Rare disease: 1 in 17 people in the UK – There are over 6,000 recognised rare diseases. – DDD: Deciphering Developmental Disorders
• Ongoing projects: – UK10K: 6000 cases, 4000 controls
Challenges: for EBI and our users
Sequencing machine
Scientist
Timothy K. Stanton
Talk outline
• Gene-c varia-on – Different types – Origins
• Why are all those variants important? – Importance and prac-cal applica-ons
• Where did they all come from? – Inves-ga-ng gene-c varia-on and progress over -me
• Ensembl and modern Bioinforma-cs – Building infrastructure for research – Interpre-ng variants
EBI
• EBI’s mission: To provide freely available data and bioinforma?cs services to all facets of the scien-fic community to promote scien-fic progress
• The world’s most comprehensive collec-on of molecular databases: from DNA and protein sequence to complex pathways and networks – Integra-on and community engagement is at the heart of these efforts
• European node for globally coordinated data collec-on and dissemina-on projects
28
Genome-‐wide data from Ensembl
Across species Within species
Synteny
Pick a genome
Orthology
Genomic alignments
SNPs
Genes Chromosomes
Gene regulation
• Ensembl’s mission: to enable genomic science
Species with variation data in Ensembl
Data access -‐ variants on the genome
Data access-‐ variants per protein
Ensembl Varia-on
Varia-on annota-on – phenotype data
ENST
CODING Synonymous
INTRONIC 5’ UTR
ATG AAAAAAA Regulatory
Splice sites
CODING Non-Synonymous
3’ UTR 5’ Upstream 3’ downstream
Consequence Types
• A SNP can be in an exon in some transcripts, and in an intron in another.
GAG >TAG Glu > STOP
GAG >GAA Glu > Glu
GAG >GGG Glu > Gly
Synonymous (silent) no change in amino acid Non-synonymous (missense) change in amino acid Stop gain (nonsense) introduces a stop codon
Consequences of variants in the protein-coding sequence
Added more detailed terms
• regulatory region
• TF binding site
• intergenic• upstream
• 5 prime UTR • initiator codon
• synonymous variant• missense variant• inframe insertion• inframe deletion• stop gained• frameshift variant• coding sequence variant
• splice donor• splice acceptor
• splice region• intron variant
• stop lost• stop retained variant• incomplete terminal
codon
• 3 prime UTR
• downstream
5’ 3’
• regulatory region
• TF binding site
• intergenic• upstream
• 5 prime UTR • initiator codon
• synonymous variant• missense variant• inframe insertion• inframe deletion• stop gained• frameshift variant• coding sequence variant
• splice donor• splice acceptor
• splice region• intron variant
• stop lost• stop retained variant• incomplete terminal
codon
• 3 prime UTR
• downstream
5’ 3’
A
A C A C A
Ref Reads
? SNP
Data access-‐ varia-on annota-on
In-dels
Structural variants
GWAS
Interpreta-on of variants
• Interpreta-on of variants is key • Ensembl is well placed for doing this with contribu-ons from all: – High-‐quality evidence-‐based gene build – Mul-ple alignments – Regulatory informa-on – Varia-on and phenotype informa-on
– VEP for all types of varia-on • Good support • Fast script version • REST API
39/75
Variant Effect Predictor
40/75
Variant Effect Predictor
Summary
• Importance of variants: their roles in disease and phenotypes differences
• Classes of variants – Short (single nucleo-de) variants: SNPs, indels – Structural variants
• Effects of variants: non-‐synonymous, stop lost etc. • Source of variants: dbSNP, Muta-on databases
– Big projects: 1000 Genomes, HapMap
• Bioinforma-cs infrastructure projects: Ensembl