Understanding proteins: resources for identification and annotation.
-
Upload
dwain-hodges -
Category
Documents
-
view
221 -
download
1
Transcript of Understanding proteins: resources for identification and annotation.
Understanding proteins: resources for identification and annotation
The Gene Ontology: Annotating protein function, role and localization
Contact:
Jane Lomax
Coordinator, GO Editorial Office
EBI-EMBL
What is an ontology?
What is an ontology?
→ Collectibles & art
→ Stamps
→ UK (Great Britain)Victoria
→ 1884 GREAT BRITAIN 10S SCOTT (11,999.99$)
A definition...
“A controlled representation of ideas, concepts or events in a given domain and the relationships between them.”
Why do we need ontologies?
Help with data retrievalallow grouping of annotations
brain 20hindbrain 15
rhombomere 10
Adapted from Barry Smith: http://ontology.buffalo.edu/smith/BioOntology_Course.html
Query ‘brain’ without ontology 20Query ‘brain’ with ontology 45
Make data (re-)usable through standards
Common structure and terminology (controlled vocabulary) Avoid redundancies (single data source) Allow common tools, techniques, training, validation...
Gene ontology
What is the gene ontology?
Organized, controlled vocabulary of terms that describe gene products characteristics.
http://geneontology.org/
• Represents gene product properties, not gene products themselves
• Three branches (domains): Cellular component Molecular function Biological process
• Species-independent (with taxonomic restrictions)
• Represents physiological processes
• Goes up to the level of the cell
The Gene Ontologyis like a dictionary
term: transcription initiation
definition: Processes involved in the assembly of the RNA polymerase complex at the promoter region of a DNA template resulting in the subsequent synthesis of RNA from that promoter.
id: GO:0006352
How does GO work?
Clark et al., 2005
part_of
is_a
GO tree and annotations
GO terms for Caspase 9
An annotation example…
Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...
Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)
attacked
time
control
Puparial adhesionMolting cyclehemocyanin
Defense responseImmune responseResponse to stimulusToll regulated genesJAK-STAT regulated genes
Immune responseToll regulated genes
Amino acid catabolismLipid metobolism
Peptidase activityProtein catabloismImmune response
Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...
Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)
Which processes are up- or down-regulated?
Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI.
QuickGO: browsing GO
Term definition
http://www.ebi.ac.uk/QuickGO/
QuickGO: browsing GO
Term relationships (ancestors)
QuickGO: browsing GO
Term relationships (children)
QuickGO: browsing GO
Proteins annotated to term
Annotation and ontology fileswww.geneontology.org/GO.downloads.shtml
Ontology files:
• Hold ontology terms and structure
• Species-independent
• You can get GO-slims
Annotation files:
• Hold list of terms and the proteins annotated with them
• You can get species-specific files or the whole annotation.
More about GO: EBI train online
www.ebi.ac.uk/training/online/course/go-quick-tourwww.ebi.ac.uk/training/online/course/uniprot-goa-quick-tour
Acknowledgements & questions
Jane LomaxCoordinator, GO Editorial Office
EBI-EMBL
UniProt: A repository of annotated protein sequences
Contact:
Duncan Legge
UniProt Content TeamEBI-EMBL
Background of UniProt
Since 2002 a merger and collaboration of three databases:
Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database
Swiss-Prot & TrEMBL PIR-PSD
We Aim To Provide…
o A high quality protein sequence databaseA non redundant protein database, with maximal coverage including splice isoforms, disease variant and PTMs. Sequence archiving essential.
o Easy protein identification Stable identifiers and consistent nomenclature / controlled vocabularies
o Thorough protein annotationDetailed information on protein function, biological processes, molecular interactions and pathways cross-referenced to external source
The Two Sides of UniProtKB
Non-redundant, high-quality manual annotation - reviewed
Redundant, automatically annotated - unreviewed
UniProtKB/TrEMBL1 entry per nucleotide submission
UniProtKB/Swiss-Prot1 entry per protein
UniProtKB/Swiss-ProtManuallyannotated
UniProtKB/TrEMBLComputationallyannotated
Data sources of UniProtKB
UniProt/TrEMBL
VEGA(Sanger)
WormBaseFlyBase
Sub/Peptide
DataPDB
Patent Data
EnsemblENA (EMBL) DNA database
mRNAData
Curation of a UniProt/SwissProt entry Sequence
Sequence variants
Nomenclature
Sequence features
UniProt/TrEMBL
UniProt/SwissProt
Ontologies
Literature Annotations
References
UniProt Websitewww.uniprot.org
UniProt layout
Annotation comments
FUNCTIONSUBCELLULAR LOCATIONALTERNATIVE PRODUCTSTISSUE SPECIFICITYDEVELOPMENTAL STAGEINDUCTIONSIMILARITYCATALYTIC ACTIVITYCOFACTORENZYME REGULATIONBIOPHYSICOCHEMICAL- PROPERTIESPATHWAYSUBUNITINTERACTION
PTMRNA EDITINGMASS SPECTROMETRYDOMAINPOLYMORPHISMDISRUPTION PHENOTYPEALLERGENDISEASETOXIC DOSEBIOTECHNOLOGYPHARMACEUTICALMISCELLANEOUSCAUTIONSEQUENCE CAUTIONWEB RESOURCE
Controlled vocabularies used whenever possible
Evidence tags to show source
Master headline
Proteomes in UniProt
Complete proteomes Complete sets of proteins thought to be expressed by organisms whose genomes have been completely sequenced.
Reference proteomesSome complete proteomes have been selected as reference proteome sets. These cover the proteomes of well-studied model organisms and other proteomes of interest for biomedical research.
Obtaining Proteomes
Help / Feedback
• Stuck? Just ask – active help and support team • Feedback – if you find something incorrect, outdated, missing etc please tell us.
www.ebi.ac.uk/training/online/course/uniprot-quick-tour/
Find out more: EBI online courses
Acknowledgements & questions
Duncan LeggeUniProt Content Team
EBI-EMBL
InterPro: An integrated protein sequence analysis resource
Contact:
Amaia Sangrador
InterPro curation TeamEBI-EMBL
What is InterPro?
• InterPro is a sequence analysis resource that classifies sequences into protein families and predicts important
domains and sites
• It combines predictive models (known as signatures) from different databases to provide functional analysis of protein sequences by classifying them into families and predicting
domains and important sites
The aim of InterPro
InterPro
Protein annotation: a predictive approach
• This is the approach taken by protein signature databases
• Model the pattern of conserved amino acids at specific positions within a multiple sequence alignment
• We can use these models to infer relationships with the characterised sequences from which the alignment was constructed
Full alignment methods
Single motif methods
Patterns
Multiple motif methods
Fingerprints
Three (4) different protein signature approaches
Profiles & Hidden Markov models (HMMs)
Structuraldomains
Functional annotation of families/domains
Protein features
(sites)
Hidden Markov Models Finger prints
Profiles Patterns
HAMAP
InterPro Consortium
• Signatures are provided by member databases
• They are scanned against the UniProt database to see which sequences they match
• Curators manually inspect the matches before integrating the signatures into InterPro
InterPro signature integration process
Signatures representing the same entity are integrated together
Relationships between entries are traced, where possible
Curators add literature referenced abstracts, cross-refs to other databases, and GO terms
Searchusing the key word:
CD4
Let’s find some information about T-cell surface antigen CD4 in InterPro
Using InterPro
Results from the “CD4” key word search
TypeName Identifier Contributing
signatures
Description
Go terms
References
Family-centered view
Search using
human CD4
protein sequence
Using InterPro
Type
Name
Identifier
DomainsFamily
Protein-centered view
TypeName Identifier Contributing
signatures
Description
References
Domain-centered view
Using InterPro with unknown sequences: InterProScan
Search with unknown protein
sequence
InterProScan is the software package that allows sequences to be scanned against InterPro's signatures
InterPro entries and contributing signatures
Unintegrated signatures
(not reviewed)
InterPro usage
within the EBI
•Used by UniProtKB curators in their annotation of Swiss-Prot proteins
• Forms part of the automated system that adds annotation to
UniProtKB/TrEMBL
• Provides matches to over 80% of UniProtKB
• Source of >60 million Gene Ontology (GO) mappings to >17 million
distinct UniProtKB sequences
outside the EBI
•50,000 unique visitors to the web site per month
•> 2 million sequences searched online per month
•Plus offline searches with downloadable version
• Probabilistic models != biological certainty
• We are using biologically-unaware search tools and probabilistic models
• Ask questions, weigh the evidence
Remember!
Caveats
We need your feedback!missing/additional referencesreporting problemsrequests
• Sheer amount of data can be overwhelming
• Member databases do not always agree!
• InterPro entries are based on signatures supplied to us by our member databases
....this means no signature, no entry!
www.ebi.ac.uk/training/online/course-list/introduction-protein-classification-ebiwww.ebi.ac.uk/training/online/course/interpro-quick-tour
www.ebi.ac.uk/training/online/course/interpro-functional-and-structural-analysis-protei
Find out more: EBI online courses
Acknowledgements & questions
Amaia SangradorInterPro curation team
EBI-EMBL
PDBe: Protein Data Bank in Europe
Contact:
Gary Battle
Project Leader Outreach
PDBe
http://www.facebook.com/proteindatabank
http://twitter.com/PDBeurope
PDBe overview
• Mission: Bringing Structure to Biology
• Major activities:
• Deposition and annotation site for structural data on biomacromolecules (X-ray, NMR, EM)
• Integration of macromolecular structure data with important biological and chemical data resources
• Provide tools and services for accessing, exploiting and disseminating structural data to the wider biomedical community
Worldwide Protein Data Bank (wwPDB)
PDBeXploreBrowse the PDB using familiar classification systems (enzymes, folds, families, compounds, taxonomy, sequence).
Latest structures: pdbe.org/pdbexplore
PDBePISAExploration of macromolecular (protein, DNA/RNA and ligand) interfaces and prediction of probable quaternary structures.
Predict quaternary structure: pdbe.org/pisa
PDBeFoldInteractive comparison, alignment and superposition based on protein secondary structure.
Find similar structures: pdbe.org/fold
PDBeMotifFlexible 3D search and analysis of protein-ligand interactions, binding environments and structural motifs.
Analyse binding sites and motifs: pdbe.org/motif
NMR resources and servicesVisualisation and validation of NMR models and data.NMR resources:pdbe.org/nmr
EM resources and servicesComprehensive search and analysis tools for EMDB entries.EM resources:pdbe.org/em
Electron Microscopy Data Bank (EMDB)• Global public repository for EM
density maps of macromolecular complexes and subcellular structures
• Founded at EBI in 2002• Jointly operated by PDBe, RCSB
and NCMI• PDBe EM portal provides advanced
search, visualisation and analysis services.http://pdbe.org/emdb
Educational resources: QuipsInteractive exploration of interesting structures from the PDB
Quite interesting PDB structures: pdbe.org/quips
Stay informed…
http://www.facebook.com/proteindatabank
http://twitter.com/PDBeurope
www.ebi.ac.uk/training/online/course/pdbe-quick-tour/
Find out more: EBI online courses