Understanding proteins: resources for identification and annotation.

72
Understanding proteins: resources for identification and annotation

Transcript of Understanding proteins: resources for identification and annotation.

Page 1: Understanding proteins: resources for identification and annotation.

Understanding proteins: resources for identification and annotation

Page 2: Understanding proteins: resources for identification and annotation.

The Gene Ontology: Annotating protein function, role and localization

Contact:

Jane Lomax

Coordinator, GO Editorial Office

EBI-EMBL

[email protected]

Page 3: Understanding proteins: resources for identification and annotation.

What is an ontology?

Page 4: Understanding proteins: resources for identification and annotation.

What is an ontology?

→ Collectibles & art

→ Stamps

→ UK (Great Britain)Victoria

→ 1884 GREAT BRITAIN 10S SCOTT (11,999.99$)

A definition...

“A controlled representation of ideas, concepts or events in a given domain and the relationships between them.”

Page 5: Understanding proteins: resources for identification and annotation.

Why do we need ontologies?

Help with data retrievalallow grouping of annotations

brain 20hindbrain 15

rhombomere 10

Adapted from Barry Smith: http://ontology.buffalo.edu/smith/BioOntology_Course.html

Query ‘brain’ without ontology 20Query ‘brain’ with ontology 45

Make data (re-)usable through standards

Common structure and terminology (controlled vocabulary) Avoid redundancies (single data source) Allow common tools, techniques, training, validation...

Page 6: Understanding proteins: resources for identification and annotation.

Gene ontology

What is the gene ontology?

Organized, controlled vocabulary of terms that describe gene products characteristics.

http://geneontology.org/

• Represents gene product properties, not gene products themselves

• Three branches (domains): Cellular component Molecular function Biological process

• Species-independent (with taxonomic restrictions)

• Represents physiological processes

• Goes up to the level of the cell

Page 7: Understanding proteins: resources for identification and annotation.

The Gene Ontologyis like a dictionary

term: transcription initiation

definition: Processes involved in the assembly of the RNA polymerase complex at the promoter region of a DNA template resulting in the subsequent synthesis of RNA from that promoter.

id: GO:0006352

How does GO work?

Page 8: Understanding proteins: resources for identification and annotation.

Clark et al., 2005

part_of

is_a

GO tree and annotations

Page 9: Understanding proteins: resources for identification and annotation.

GO terms for Caspase 9

An annotation example…

Page 10: Understanding proteins: resources for identification and annotation.

Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...

Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)

attacked

time

control

Puparial adhesionMolting cyclehemocyanin

Defense responseImmune responseResponse to stimulusToll regulated genesJAK-STAT regulated genes

Immune responseToll regulated genes

Amino acid catabolismLipid metobolism

Peptidase activityProtein catabloismImmune response

Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...

Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)

Which processes are up- or down-regulated?

Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI.

Page 11: Understanding proteins: resources for identification and annotation.

QuickGO: browsing GO

Term definition

http://www.ebi.ac.uk/QuickGO/

Page 12: Understanding proteins: resources for identification and annotation.

QuickGO: browsing GO

Term relationships (ancestors)

Page 13: Understanding proteins: resources for identification and annotation.

QuickGO: browsing GO

Term relationships (children)

Page 14: Understanding proteins: resources for identification and annotation.

QuickGO: browsing GO

Proteins annotated to term

Page 15: Understanding proteins: resources for identification and annotation.

Annotation and ontology fileswww.geneontology.org/GO.downloads.shtml

Ontology files:

• Hold ontology terms and structure

• Species-independent

• You can get GO-slims

Annotation files:

• Hold list of terms and the proteins annotated with them

• You can get species-specific files or the whole annotation.

Page 16: Understanding proteins: resources for identification and annotation.

More about GO: EBI train online

www.ebi.ac.uk/training/online/course/go-quick-tourwww.ebi.ac.uk/training/online/course/uniprot-goa-quick-tour

Page 17: Understanding proteins: resources for identification and annotation.

Acknowledgements & questions

Jane LomaxCoordinator, GO Editorial Office

EBI-EMBL

[email protected]

Page 18: Understanding proteins: resources for identification and annotation.

UniProt: A repository of annotated protein sequences

Contact:

Duncan Legge

UniProt Content TeamEBI-EMBL

[email protected]

[email protected]

Page 19: Understanding proteins: resources for identification and annotation.

Background of UniProt

Since 2002 a merger and collaboration of three databases:

Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database

Swiss-Prot & TrEMBL PIR-PSD

Page 20: Understanding proteins: resources for identification and annotation.

We Aim To Provide…

o A high quality protein sequence databaseA non redundant protein database, with maximal coverage including splice isoforms, disease variant and PTMs. Sequence archiving essential.

o Easy protein identification Stable identifiers and consistent nomenclature / controlled vocabularies

o Thorough protein annotationDetailed information on protein function, biological processes, molecular interactions and pathways cross-referenced to external source

Page 21: Understanding proteins: resources for identification and annotation.

The Two Sides of UniProtKB

Non-redundant, high-quality manual annotation - reviewed

Redundant, automatically annotated - unreviewed

UniProtKB/TrEMBL1 entry per nucleotide submission

UniProtKB/Swiss-Prot1 entry per protein

Page 22: Understanding proteins: resources for identification and annotation.

UniProtKB/Swiss-ProtManuallyannotated

UniProtKB/TrEMBLComputationallyannotated

Page 23: Understanding proteins: resources for identification and annotation.

Data sources of UniProtKB

UniProt/TrEMBL

VEGA(Sanger)

WormBaseFlyBase

Sub/Peptide

DataPDB

Patent Data

EnsemblENA (EMBL) DNA database

mRNAData

Page 24: Understanding proteins: resources for identification and annotation.

Curation of a UniProt/SwissProt entry Sequence

Sequence variants

Nomenclature

Sequence features

UniProt/TrEMBL

UniProt/SwissProt

Ontologies

Literature Annotations

References

Page 25: Understanding proteins: resources for identification and annotation.

UniProt Websitewww.uniprot.org

Page 26: Understanding proteins: resources for identification and annotation.

UniProt layout

Page 27: Understanding proteins: resources for identification and annotation.
Page 28: Understanding proteins: resources for identification and annotation.

Annotation comments

FUNCTIONSUBCELLULAR LOCATIONALTERNATIVE PRODUCTSTISSUE SPECIFICITYDEVELOPMENTAL STAGEINDUCTIONSIMILARITYCATALYTIC ACTIVITYCOFACTORENZYME REGULATIONBIOPHYSICOCHEMICAL- PROPERTIESPATHWAYSUBUNITINTERACTION

PTMRNA EDITINGMASS SPECTROMETRYDOMAINPOLYMORPHISMDISRUPTION PHENOTYPEALLERGENDISEASETOXIC DOSEBIOTECHNOLOGYPHARMACEUTICALMISCELLANEOUSCAUTIONSEQUENCE CAUTIONWEB RESOURCE

Page 29: Understanding proteins: resources for identification and annotation.

Controlled vocabularies used whenever possible

Evidence tags to show source

Page 30: Understanding proteins: resources for identification and annotation.

Master headline

Page 31: Understanding proteins: resources for identification and annotation.

Proteomes in UniProt

Complete proteomes Complete sets of proteins thought to be expressed by organisms whose genomes have been completely sequenced.

Reference proteomesSome complete proteomes have been selected as reference proteome sets. These cover the proteomes of well-studied model organisms and other proteomes of interest for biomedical research.

Page 32: Understanding proteins: resources for identification and annotation.

Obtaining Proteomes

Page 33: Understanding proteins: resources for identification and annotation.

Help / Feedback

• Stuck? Just ask – active help and support team • Feedback – if you find something incorrect, outdated, missing etc please tell us.

[email protected]

Page 34: Understanding proteins: resources for identification and annotation.

www.ebi.ac.uk/training/online/course/uniprot-quick-tour/

Find out more: EBI online courses

Page 35: Understanding proteins: resources for identification and annotation.

Acknowledgements & questions

Duncan LeggeUniProt Content Team

EBI-EMBL

[email protected]

Page 36: Understanding proteins: resources for identification and annotation.

InterPro: An integrated protein sequence analysis resource

Contact:

Amaia Sangrador

InterPro curation TeamEBI-EMBL

[email protected]

[email protected]

Page 37: Understanding proteins: resources for identification and annotation.

What is InterPro?

• InterPro is a sequence analysis resource that classifies sequences into protein families and predicts important

domains and sites

• It combines predictive models (known as signatures) from different databases to provide functional analysis of protein sequences by classifying them into families and predicting

domains and important sites

Page 38: Understanding proteins: resources for identification and annotation.

The aim of InterPro

InterPro

Page 39: Understanding proteins: resources for identification and annotation.

Protein annotation: a predictive approach

• This is the approach taken by protein signature databases

• Model the pattern of conserved amino acids at specific positions within a multiple sequence alignment

• We can use these models to infer relationships with the characterised sequences from which the alignment was constructed

Page 40: Understanding proteins: resources for identification and annotation.

Full alignment methods

Single motif methods

Patterns

Multiple motif methods

Fingerprints

Three (4) different protein signature approaches

Profiles & Hidden Markov models (HMMs)

Page 41: Understanding proteins: resources for identification and annotation.

Structuraldomains

Functional annotation of families/domains

Protein features 

(sites)

Hidden Markov Models Finger prints

Profiles Patterns

HAMAP

InterPro Consortium

Page 42: Understanding proteins: resources for identification and annotation.

• Signatures are provided by member databases

• They are scanned against the UniProt database to see which sequences they match

• Curators manually inspect the matches before integrating the signatures into InterPro

InterPro signature integration process

Signatures representing the same entity are integrated together

Relationships between entries are traced, where possible

Curators add literature referenced abstracts, cross-refs to other databases, and GO terms

Page 43: Understanding proteins: resources for identification and annotation.

http://www.ebi.ac.uk/interpro/

Page 44: Understanding proteins: resources for identification and annotation.

Searchusing the key word:

CD4

Let’s find some information about T-cell surface antigen CD4 in InterPro

Using InterPro

Page 45: Understanding proteins: resources for identification and annotation.

Results from the “CD4” key word search

Page 46: Understanding proteins: resources for identification and annotation.

TypeName Identifier Contributing

signatures

Description

Go terms

References

Family-centered view

Page 47: Understanding proteins: resources for identification and annotation.

Search using

human CD4

protein sequence

Using InterPro

Page 48: Understanding proteins: resources for identification and annotation.

Type

Name

Identifier

DomainsFamily

Protein-centered view

Page 49: Understanding proteins: resources for identification and annotation.

TypeName Identifier Contributing

signatures

Description

References

Domain-centered view

Page 50: Understanding proteins: resources for identification and annotation.

Using InterPro with unknown sequences: InterProScan

Search with unknown protein

sequence

InterProScan is the software package that allows sequences to be scanned against InterPro's signatures

Page 51: Understanding proteins: resources for identification and annotation.

InterPro entries and contributing signatures

Unintegrated signatures

(not reviewed)

Page 52: Understanding proteins: resources for identification and annotation.

InterPro usage

within the EBI

•Used by UniProtKB curators in their annotation of Swiss-Prot proteins

• Forms part of the automated system that adds annotation to

UniProtKB/TrEMBL

• Provides matches to over 80% of UniProtKB

• Source of >60 million Gene Ontology (GO) mappings to >17 million

distinct UniProtKB sequences

outside the EBI

•50,000 unique visitors to the web site per month

•> 2 million sequences searched online per month

•Plus offline searches with downloadable version

Page 53: Understanding proteins: resources for identification and annotation.

• Probabilistic models != biological certainty

• We are using biologically-unaware search tools and probabilistic models

• Ask questions, weigh the evidence

Remember!

Page 54: Understanding proteins: resources for identification and annotation.

Caveats

We need your feedback!missing/additional referencesreporting problemsrequests

• Sheer amount of data can be overwhelming

• Member databases do not always agree!

• InterPro entries are based on signatures supplied to us by our member databases

....this means no signature, no entry!

[email protected]

Page 55: Understanding proteins: resources for identification and annotation.

www.ebi.ac.uk/training/online/course-list/introduction-protein-classification-ebiwww.ebi.ac.uk/training/online/course/interpro-quick-tour

www.ebi.ac.uk/training/online/course/interpro-functional-and-structural-analysis-protei

Find out more: EBI online courses

Page 56: Understanding proteins: resources for identification and annotation.

Acknowledgements & questions

Amaia SangradorInterPro curation team

EBI-EMBL

[email protected]

Page 57: Understanding proteins: resources for identification and annotation.

PDBe: Protein Data Bank in Europe

Contact:

Gary Battle

Project Leader Outreach

PDBe

[email protected]

http://www.facebook.com/proteindatabank

http://twitter.com/PDBeurope

Page 58: Understanding proteins: resources for identification and annotation.

PDBe overview

• Mission: Bringing Structure to Biology

• Major activities:

• Deposition and annotation site for structural data on biomacromolecules (X-ray, NMR, EM)

• Integration of macromolecular structure data with important biological and chemical data resources

• Provide tools and services for accessing, exploiting and disseminating structural data to the wider biomedical community

Page 59: Understanding proteins: resources for identification and annotation.

Worldwide Protein Data Bank (wwPDB)

Page 60: Understanding proteins: resources for identification and annotation.
Page 61: Understanding proteins: resources for identification and annotation.
Page 62: Understanding proteins: resources for identification and annotation.

PDBeXploreBrowse the PDB using familiar classification systems (enzymes, folds, families, compounds, taxonomy, sequence).

Latest structures: pdbe.org/pdbexplore

Page 63: Understanding proteins: resources for identification and annotation.

PDBePISAExploration of macromolecular (protein, DNA/RNA and ligand) interfaces and prediction of probable quaternary structures.

Predict quaternary structure: pdbe.org/pisa

Page 64: Understanding proteins: resources for identification and annotation.

PDBeFoldInteractive comparison, alignment and superposition based on protein secondary structure.

Find similar structures: pdbe.org/fold

Page 65: Understanding proteins: resources for identification and annotation.

PDBeMotifFlexible 3D search and analysis of protein-ligand interactions, binding environments and structural motifs.

Analyse binding sites and motifs: pdbe.org/motif

Page 66: Understanding proteins: resources for identification and annotation.

NMR resources and servicesVisualisation and validation of NMR models and data.NMR resources:pdbe.org/nmr

Page 67: Understanding proteins: resources for identification and annotation.

EM resources and servicesComprehensive search and analysis tools for EMDB entries.EM resources:pdbe.org/em

Page 68: Understanding proteins: resources for identification and annotation.

Electron Microscopy Data Bank (EMDB)• Global public repository for EM

density maps of macromolecular complexes and subcellular structures

• Founded at EBI in 2002• Jointly operated by PDBe, RCSB

and NCMI• PDBe EM portal provides advanced

search, visualisation and analysis services.http://pdbe.org/emdb

Page 69: Understanding proteins: resources for identification and annotation.

Educational resources: QuipsInteractive exploration of interesting structures from the PDB

Quite interesting PDB structures: pdbe.org/quips

Page 70: Understanding proteins: resources for identification and annotation.

Stay informed…

http://www.facebook.com/proteindatabank

http://twitter.com/PDBeurope

Page 71: Understanding proteins: resources for identification and annotation.

www.ebi.ac.uk/training/online/course/pdbe-quick-tour/

Find out more: EBI online courses

Page 72: Understanding proteins: resources for identification and annotation.

Acknowledgements & questions

Gary BattleEBI-EMBL

[email protected]