Understanding proteins: resources for identification and annotation.

Post on 11-Jan-2016

221 views 1 download

Transcript of Understanding proteins: resources for identification and annotation.

Understanding proteins: resources for identification and annotation

The Gene Ontology: Annotating protein function, role and localization

Contact:

Jane Lomax

Coordinator, GO Editorial Office

EBI-EMBL

jane@ebi.ac.uk

What is an ontology?

What is an ontology?

→ Collectibles & art

→ Stamps

→ UK (Great Britain)Victoria

→ 1884 GREAT BRITAIN 10S SCOTT (11,999.99$)

A definition...

“A controlled representation of ideas, concepts or events in a given domain and the relationships between them.”

Why do we need ontologies?

Help with data retrievalallow grouping of annotations

brain 20hindbrain 15

rhombomere 10

Adapted from Barry Smith: http://ontology.buffalo.edu/smith/BioOntology_Course.html

Query ‘brain’ without ontology 20Query ‘brain’ with ontology 45

Make data (re-)usable through standards

Common structure and terminology (controlled vocabulary) Avoid redundancies (single data source) Allow common tools, techniques, training, validation...

Gene ontology

What is the gene ontology?

Organized, controlled vocabulary of terms that describe gene products characteristics.

http://geneontology.org/

• Represents gene product properties, not gene products themselves

• Three branches (domains): Cellular component Molecular function Biological process

• Species-independent (with taxonomic restrictions)

• Represents physiological processes

• Goes up to the level of the cell

The Gene Ontologyis like a dictionary

term: transcription initiation

definition: Processes involved in the assembly of the RNA polymerase complex at the promoter region of a DNA template resulting in the subsequent synthesis of RNA from that promoter.

id: GO:0006352

How does GO work?

Clark et al., 2005

part_of

is_a

GO tree and annotations

GO terms for Caspase 9

An annotation example…

Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...

Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)

attacked

time

control

Puparial adhesionMolting cyclehemocyanin

Defense responseImmune responseResponse to stimulusToll regulated genesJAK-STAT regulated genes

Immune responseToll regulated genes

Amino acid catabolismLipid metobolism

Peptidase activityProtein catabloismImmune response

Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...

Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)

Which processes are up- or down-regulated?

Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI.

QuickGO: browsing GO

Term definition

http://www.ebi.ac.uk/QuickGO/

QuickGO: browsing GO

Term relationships (ancestors)

QuickGO: browsing GO

Term relationships (children)

QuickGO: browsing GO

Proteins annotated to term

Annotation and ontology fileswww.geneontology.org/GO.downloads.shtml

Ontology files:

• Hold ontology terms and structure

• Species-independent

• You can get GO-slims

Annotation files:

• Hold list of terms and the proteins annotated with them

• You can get species-specific files or the whole annotation.

More about GO: EBI train online

www.ebi.ac.uk/training/online/course/go-quick-tourwww.ebi.ac.uk/training/online/course/uniprot-goa-quick-tour

Acknowledgements & questions

Jane LomaxCoordinator, GO Editorial Office

EBI-EMBL

jane@ebi.ac.uk

UniProt: A repository of annotated protein sequences

Contact:

Duncan Legge

UniProt Content TeamEBI-EMBL

help@uniprot.org

dlegge@ebi.ac.uk

Background of UniProt

Since 2002 a merger and collaboration of three databases:

Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database

Swiss-Prot & TrEMBL PIR-PSD

We Aim To Provide…

o A high quality protein sequence databaseA non redundant protein database, with maximal coverage including splice isoforms, disease variant and PTMs. Sequence archiving essential.

o Easy protein identification Stable identifiers and consistent nomenclature / controlled vocabularies

o Thorough protein annotationDetailed information on protein function, biological processes, molecular interactions and pathways cross-referenced to external source

The Two Sides of UniProtKB

Non-redundant, high-quality manual annotation - reviewed

Redundant, automatically annotated - unreviewed

UniProtKB/TrEMBL1 entry per nucleotide submission

UniProtKB/Swiss-Prot1 entry per protein

UniProtKB/Swiss-ProtManuallyannotated

UniProtKB/TrEMBLComputationallyannotated

Data sources of UniProtKB

UniProt/TrEMBL

VEGA(Sanger)

WormBaseFlyBase

Sub/Peptide

DataPDB

Patent Data

EnsemblENA (EMBL) DNA database

mRNAData

Curation of a UniProt/SwissProt entry Sequence

Sequence variants

Nomenclature

Sequence features

UniProt/TrEMBL

UniProt/SwissProt

Ontologies

Literature Annotations

References

UniProt Websitewww.uniprot.org

UniProt layout

Annotation comments

FUNCTIONSUBCELLULAR LOCATIONALTERNATIVE PRODUCTSTISSUE SPECIFICITYDEVELOPMENTAL STAGEINDUCTIONSIMILARITYCATALYTIC ACTIVITYCOFACTORENZYME REGULATIONBIOPHYSICOCHEMICAL- PROPERTIESPATHWAYSUBUNITINTERACTION

PTMRNA EDITINGMASS SPECTROMETRYDOMAINPOLYMORPHISMDISRUPTION PHENOTYPEALLERGENDISEASETOXIC DOSEBIOTECHNOLOGYPHARMACEUTICALMISCELLANEOUSCAUTIONSEQUENCE CAUTIONWEB RESOURCE

Controlled vocabularies used whenever possible

Evidence tags to show source

Master headline

Proteomes in UniProt

Complete proteomes Complete sets of proteins thought to be expressed by organisms whose genomes have been completely sequenced.

Reference proteomesSome complete proteomes have been selected as reference proteome sets. These cover the proteomes of well-studied model organisms and other proteomes of interest for biomedical research.

Obtaining Proteomes

Help / Feedback

• Stuck? Just ask – active help and support team • Feedback – if you find something incorrect, outdated, missing etc please tell us.

help@uniprot.org

www.ebi.ac.uk/training/online/course/uniprot-quick-tour/

Find out more: EBI online courses

Acknowledgements & questions

Duncan LeggeUniProt Content Team

EBI-EMBL

dlegge@ebi.ac.uk

InterPro: An integrated protein sequence analysis resource

Contact:

Amaia Sangrador

InterPro curation TeamEBI-EMBL

interhelp@ebi.ac.uk

amaia@ebi.ac.uk

What is InterPro?

• InterPro is a sequence analysis resource that classifies sequences into protein families and predicts important

domains and sites

• It combines predictive models (known as signatures) from different databases to provide functional analysis of protein sequences by classifying them into families and predicting

domains and important sites

The aim of InterPro

InterPro

Protein annotation: a predictive approach

• This is the approach taken by protein signature databases

• Model the pattern of conserved amino acids at specific positions within a multiple sequence alignment

• We can use these models to infer relationships with the characterised sequences from which the alignment was constructed

Full alignment methods

Single motif methods

Patterns

Multiple motif methods

Fingerprints

Three (4) different protein signature approaches

Profiles & Hidden Markov models (HMMs)

Structuraldomains

Functional annotation of families/domains

Protein features 

(sites)

Hidden Markov Models Finger prints

Profiles Patterns

HAMAP

InterPro Consortium

• Signatures are provided by member databases

• They are scanned against the UniProt database to see which sequences they match

• Curators manually inspect the matches before integrating the signatures into InterPro

InterPro signature integration process

Signatures representing the same entity are integrated together

Relationships between entries are traced, where possible

Curators add literature referenced abstracts, cross-refs to other databases, and GO terms

http://www.ebi.ac.uk/interpro/

Searchusing the key word:

CD4

Let’s find some information about T-cell surface antigen CD4 in InterPro

Using InterPro

Results from the “CD4” key word search

TypeName Identifier Contributing

signatures

Description

Go terms

References

Family-centered view

Search using

human CD4

protein sequence

Using InterPro

Type

Name

Identifier

DomainsFamily

Protein-centered view

TypeName Identifier Contributing

signatures

Description

References

Domain-centered view

Using InterPro with unknown sequences: InterProScan

Search with unknown protein

sequence

InterProScan is the software package that allows sequences to be scanned against InterPro's signatures

InterPro entries and contributing signatures

Unintegrated signatures

(not reviewed)

InterPro usage

within the EBI

•Used by UniProtKB curators in their annotation of Swiss-Prot proteins

• Forms part of the automated system that adds annotation to

UniProtKB/TrEMBL

• Provides matches to over 80% of UniProtKB

• Source of >60 million Gene Ontology (GO) mappings to >17 million

distinct UniProtKB sequences

outside the EBI

•50,000 unique visitors to the web site per month

•> 2 million sequences searched online per month

•Plus offline searches with downloadable version

• Probabilistic models != biological certainty

• We are using biologically-unaware search tools and probabilistic models

• Ask questions, weigh the evidence

Remember!

Caveats

We need your feedback!missing/additional referencesreporting problemsrequests

• Sheer amount of data can be overwhelming

• Member databases do not always agree!

• InterPro entries are based on signatures supplied to us by our member databases

....this means no signature, no entry!

interhelp@ebi.ac.uk

www.ebi.ac.uk/training/online/course-list/introduction-protein-classification-ebiwww.ebi.ac.uk/training/online/course/interpro-quick-tour

www.ebi.ac.uk/training/online/course/interpro-functional-and-structural-analysis-protei

Find out more: EBI online courses

Acknowledgements & questions

Amaia SangradorInterPro curation team

EBI-EMBL

amaia@ebi.ac.uk

PDBe: Protein Data Bank in Europe

Contact:

Gary Battle

Project Leader Outreach

PDBe

battle@ebi.ac.uk

http://www.facebook.com/proteindatabank

http://twitter.com/PDBeurope

PDBe overview

• Mission: Bringing Structure to Biology

• Major activities:

• Deposition and annotation site for structural data on biomacromolecules (X-ray, NMR, EM)

• Integration of macromolecular structure data with important biological and chemical data resources

• Provide tools and services for accessing, exploiting and disseminating structural data to the wider biomedical community

Worldwide Protein Data Bank (wwPDB)

PDBeXploreBrowse the PDB using familiar classification systems (enzymes, folds, families, compounds, taxonomy, sequence).

Latest structures: pdbe.org/pdbexplore

PDBePISAExploration of macromolecular (protein, DNA/RNA and ligand) interfaces and prediction of probable quaternary structures.

Predict quaternary structure: pdbe.org/pisa

PDBeFoldInteractive comparison, alignment and superposition based on protein secondary structure.

Find similar structures: pdbe.org/fold

PDBeMotifFlexible 3D search and analysis of protein-ligand interactions, binding environments and structural motifs.

Analyse binding sites and motifs: pdbe.org/motif

NMR resources and servicesVisualisation and validation of NMR models and data.NMR resources:pdbe.org/nmr

EM resources and servicesComprehensive search and analysis tools for EMDB entries.EM resources:pdbe.org/em

Electron Microscopy Data Bank (EMDB)• Global public repository for EM

density maps of macromolecular complexes and subcellular structures

• Founded at EBI in 2002• Jointly operated by PDBe, RCSB

and NCMI• PDBe EM portal provides advanced

search, visualisation and analysis services.http://pdbe.org/emdb

Educational resources: QuipsInteractive exploration of interesting structures from the PDB

Quite interesting PDB structures: pdbe.org/quips

Stay informed…

http://www.facebook.com/proteindatabank

http://twitter.com/PDBeurope

www.ebi.ac.uk/training/online/course/pdbe-quick-tour/

Find out more: EBI online courses

Acknowledgements & questions

Gary BattleEBI-EMBL

battle@ebi.ac.uk