publications - European Bioinformatics Institute · 6th Sept Variation data in Ensembl and the...

Post on 29-May-2020

1 views 0 download

Transcript of publications - European Bioinformatics Institute · 6th Sept Variation data in Ensembl and the...

• Ensembl training materials are protected by a CC BY license

http://creativecommons.org/licenses/by/4.0/

• If you wish to re-use these materials, please credit Ensembl for their creation

• If you use Ensembl for your work, please cite our papers

http://www.ensembl.org/info/about/publications.html

Training materials

Variation data in Ensembl

Erin Haskellhelpdesk@ensembl.org

@ensembl /@ensemblgenomes

Questions?○ We’ve muted all of your microphones

○ Join our Slack workspace and ask questions (link in your registration confirmation email)

○ My Ensembl colleagues will respond during the talk

○ Please reply @username to reply to a specific person

Emily Perry Astrid Gall

Course exercisesAll materials and exercises located here:

http://www.ebi.ac.uk/training/online/course/ensembl-browser-webinar-series-2016

A link to exercises and their solutions will appear in the

page hierarchy

This text will be replaced by a YouTube (link to YouKu too) video of the webinar

and a pdf of the slides.

The “next page” will be the exercises

Get help with the exercises

• Use the exercise solutions in the online course

• Join our Slack workspace and discuss the exercises with everybody in dedicated channels (register to get sent a link)

• Email us helpdesk@ensembl.org

This webinar courseDate Webinar topic Instructor

4th Sept Introduction to Ensembl ✔

Ensembl genes ✔

Astrid Gall

Emily Perry

6th Sept Variation data in Ensembl and the Ensembl VEP

Comparing genes and genomes with Ensembl Compara

Erin Haskell

Astrid Gall

11th Sept Finding features that regulate genes – the Ensembl Regulatory Build

Data export with BioMart

Emily Perry

Erin Haskell

13th Sept Uploading your data to Ensembl

Introduction to the Ensembl REST APIs

Astrid Gall

Emily Perry

Variation data in Ensembl

Erin Haskellhelpdesk@ensembl.org

@ensembl /@mycoacia

Session structurePresentation:Part 1: Ensembl variation dataPart 2: The Ensembl Variant Effect Predictor (VEP)

Demo:Part 1: Viewing variation data in the browserPart 2: Using the VEP

Exercises:Available on the train online site

Ensembl variation data- What types of variants are in Ensembl?

- Where does the data come from?

- What are the biological consequences of variants?

- Things to watch out for

The Ensembl Variant Effect Predictor (VEP) tool- What data can I use with the VEP?

- Identifying known variants

- Predicting consequences for novel variants

Session Overview

What types of variant are in Ensembl?

ensembl.org/info/genome/variation/index.html

Two broad categories:

1. Sequence variants (small alterations ≤50bp)

2. Structural variants (larger alterations ≥50bp)

Variant type 1: Sequence variants

● Single nucleotide polymorphisms (SNP/SNV)

ref...TTGACGTA...

alt...TTGGCGTA...

● Small insertions & deletions

ref...TTGACGTA... ins...TTGAGCGTA...del...TTG-CGTA...

indel...TTGGCTCGTA...

http://www.ensembl.org/info/genome/variation/prediction/classification.html

● Copy number variation (CNV)

● Inversion - nucleotide sequence inverted at same position

● Translocation - nucleotide sequence moved to a new position

Variant type 2: Structural variants

RefGainLoss

RefInvert

> > >> > >

RefTranslocated: same chromosomeTranslocated: diff chromosome

http://www.ensembl.org/info/genome/variation/prediction/classification.html

Where does the data come from?

Linked data

Quality control

Variant import

Ensembl analysis

The Ensembl variation process

Ensembl variation process: Import

Linked data

Quality control

Variant import

Ensembl analysis

Import variant data from

publicly available archives

and data repositories.

http://www.ensembl.org/info/genome/variation/species/sources_documentation.html

EVA

...and many many more

Data import: 23 species with variation data

http://www.ensembl.org/info/genome/variation/species/species_data_types.html

http://ensemblgenomes.org/info/genomes?variation=1

Division Number of species with variation data

Bacteria 0

Fungi 8

Metazoa 4

Plants 12

Protists 3

Data import: 27 species with variation data

Ensembl variation process: QC

Linked data

Quality control

Variant import

Ensembl analysis

● Mapping to reference assembly○ GRCh37 GRCh38

● Checks on alleles

● Checks for IUPAC ambiguity codes

● Excluding ‘suspect’ variants

http://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#quality_control

http://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html

Ensembl variation process: Linked data

Linked data

Quality control

Variant import

Ensembl analysis

Import ‘accessory’ data

● Phenotype/disease

● Allele frequencies

● Publication data

CEU CHBJPT

LWKMSLASW

YRI

TSIMXL

GIHPUR

CLM

PEL

ACB

GW

D

IBR

GBRFIN

CHS

KHV

CDXPJL

Sequencing 2,500 individuals at 4X coverage

BEB

ITUSTU

ESN

Linked data: 1000 genomes project

America Africa Europe East Asia Central-South Asia http://www.internationalgenome.org

macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/

The Genome Aggregation Database provides allele frequency data from 7 different populations

Linked data: GnomAD allele frequencies

Sam

ple

nu

mb

er

Ensembl variation process: Analysis

Linked data

Quality control

Variant import

Ensembl analysis

Ensembl predicts:

● Variant consequences

● Protein function prediction

● Linkage disequilibrium data

● Variant conservation across species

http://www.ensembl.org/info/genome/variation/prediction/index.html

http://www.ensembl.org/info/genome/variation/prediction/predicted_data.html

Analysis: Variant consequence termsStandardised variant consequence terms as defined by

http://www.sequenceontology.org

http://www.ensembl.org/info/genome/variation/prediction/predicted_data.html

Analysis: Variant consequence termsStandardised variant consequence terms as defined by

http://www.sequenceontology.org

- For missense variants only

- Two prediction algorithms:

- SIFT (Sorting Intolerant From Tolerant)

- PolyPhen (Polymorphism Phenotyping)

Score changes in amino acid sequence based on:

- How conserved the amino acid is

- The chemical change in the amino acid

Analysis: Pathogenicity scores

ensembl.org/info/genome/variation/predicted_data.html#sift

SIFT1

0

0.05Deleterious

Tolerated

0

0.2

0.1

1Probably damaging

Benign

Possibly damaging

PolyPhen

Analysis: Pathogenicity scores

Analysis: Linkage disequilibrium

Linkage Disequilibrium (LD)

“the non-random association of

alleles at 2 or more loci within a given

population”

or

“how often two variants or specific

sequences are inherited together”

Analysis: Linkage disequilibrium

The Linkage Disequilibrium (LD) calculator

Within a genomic region...

For a list of variants...

For an defined area surrounding

your variant...

Where can I find this data?

● Website www.ensembl.org

● Variant Effect Predictor (VEP)

● BioMart

● Programmatically:

○ Perl API (including VEP)

○ REST API

Ensembl variation process

Linked data

Quality control

Variant import

Ensembl analysis

IM

CM

AL

BL

BL102

AL476

CM

553IM

768

AL476

AGTCGTAGCTAGCAAGGCCATAGGCGA

Frequency A = 0.01, frequency G = 0.99G is the ancestral alleleA causes disease susceptibility

A is allele in the contig used⸫ A is the reference allele⸫ G is the alternate allele⸫ Alleles are A/G

Note: Reference & alternate alleles

Note: Reference & alternate alleles

http://www.ensembl.org/Homo_sapiens/Variation/Population?db=core;r=12:120999079-121000079;v=rs1169305;vdb=variation;vf=829489

AGTCGTAGCTAGCT/GAGGCCATAGGCGA

TCGCCTATGGCCTA/CGCTAGCTACGACT

Exon sequence:TATGGCCTA/CGCTAGC

Alleles in database = T/GAlleles in gene = A/C

Alleles = A/C -ve strand or T/G +ve strand

Alleles = A/C or T/GOften lack further info

Note: Allele strand

Demonstration

- Finding variants in a gene of interest, MCM6

- Finding variants at a genomic location of interest

- Finding out more information about a specific variant, rs4988235

The Variant Effect Predictor

McLaren et al 2016 europepmc.org/abstract/MED/27268795

Your variant data

What does the VEP do?

• Affected gene, transcript

and protein sequence

• Splicing consequences

• Regulatory consequences

• Known variants:

+ Pathogenicity

+ Frequency data

+ Literature citations

A tool to predict and annotate the functional consequences of variants

/

What does the VEP do?

Variant data input formatsVariant coordinates(Ensembl default)

1 881907 881906 -/C +5 140532 140532 T/C +12 1017956 1017956 T/A +2 946507 946507 G/C +14 19584687 19584687 C/T -

HGVS notation ENST00000285667.3:c.1047_1048insC5:g.140532T>CNM_153681.2:c.7C>TENSP00000439902.1:p.Ala2233AspNP_000050.2:p.Ile2285Val

VCF #CHROM POS ID REF ALT20 14370 rs6054257 G A20 17330 . T A20 1110696 rs6040355 A G,T20 1230237 . T .

Variant IDs rs41293501COSM327779rs146120136FANCD1:c.475G>Ars373400041

http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#input

Are your variants are already known?

○ dbSNP○ COSMIC○ Clinvar○ ESP○ HGMD-Public○ Phencode

How common are your variant alleles in different populations?

○ 1000 Genomes○ ESP ○ ExAC projects○ GnomAD

Phenotype/disease, clinical significance○ OMIM○ Orphanet○ GWAS catalog○ ClinVar

VEP features: finding known variants

Consequence predictions (choose multiple databases)○ Ensembl○ RefSeq○ Merged○ GENCODE basic

Does your variant overlap regulatory regions?○ ENCODE

○ BLUEPRINT

○ NIH Epigenomics Roadmap

○ Can be limited to regulatory regions observed in specific cell types.

Pathogenicity predictions○ SIFT○ PolyPhen○ via plugins: CADD, FATHMM, LRT, MutationTaster, and many more!

VEP features: consequence prediction

Plugin info: http://www.ensembl.info/ecode/category/vep-plugins/

VEP features: plugins

Plugin info: http://www.ensembl.info/ecode/category/vep-plugins/

● Plugins add extra functionality to the VEP

● They may extend, filter or manipulate the output of the VEP.

● Plugins may make use of external data or code.

● Available on the web tool and with the script.

Use VEP with any species

http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html

● Access through the

web browser, REST

API or Perl API

● Use prebuilt caches

for Ensembl species.

...and for all species in

Use VEP with any species

http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html

● Speed up your VEP script with an offline cache.

● Or make your own from GTF and FASTA files - even for

genomes not in Ensembl.

Using VEP

ensembl.org/info/docs/tools/vep/index.html

We have identified four variants on human chromosome nine:- A deletion at 128328461 - C->A at 128322349- C->G at 128323079- G->A at 128322917

We will use the Ensembl VEP to find out:- Are any of my variants already known?- What genes are affected by my variants?- Do any of my variants affect gene regulation?

Demonstration

Questions?○ We’ve muted all of your microphones

○ Join our Slack workspace and ask questions (link in your registration confirmation email)

○ My Ensembl colleagues will respond during the talk

○ Please reply @username to reply to a specific person

Emily Perry Astrid Gall

Course exercisesAll materials and exercises located here:

http://www.ebi.ac.uk/training/online/course/ensembl-browser-webinar-series-2016

A link to exercises and their solutions will appear in the

page hierarchy

This text will be replaced by a YouTube (link to YouKu too) video of the webinar

and a pdf of the slides.

The “next page” will be the exercises

Get help with the exercises

• Use the exercise solutions in the online course

• Join our Slack workspace and discuss the exercises with everybody in dedicated channels (register to get sent a link)

• Email us helpdesk@ensembl.org

This webinar courseDate Webinar topic Instructor

4th Sept Introduction to Ensembl ✔�

Ensembl genes ✔�

Astrid Gall

Emily Perry

6th Sept Variation data in Ensembl and the Ensembl VEP ✔�

Comparing genes and genomes with Ensembl Compara

Erin Haskell

Astrid Gall

11th Sept Finding features that regulate genes – the Ensembl Regulatory Build

Data export with BioMart

Emily Perry

Erin Haskell

13th Sept Uploading your data to Ensembl

Introduction to the Ensembl REST APIs

Astrid Gall

Emily Perry

Coming up!

Comparing genes and genomes with Ensembl Compara

Ensembl Compara allows you to perform detailed analysis of

gene models between species.

During this webinar we take a look at the gene trees and

homologues of a set of genes, and at whole genome alignments

between pairs and groups of species.

Starting in ∼5 minutes! Astrid Gall

• Ensembl training materials are protected by a CC BY license

http://creativecommons.org/licenses/by/4.0/

• If you wish to re-use these materials, please credit Ensembl for their creation

• If you use Ensembl for your work, please cite our papers

http://www.ensembl.org/info/about/publications.html

Training materials