Protein function Where to find it. How to predict it. How to classify it. Stuart Rison Department of...
-
Upload
kory-jennings -
Category
Documents
-
view
226 -
download
0
Transcript of Protein function Where to find it. How to predict it. How to classify it. Stuart Rison Department of...
Protein functionWhere to find it.How to predict it.How to classify it.
Stuart Rison
Department of Biochemistry, UCL
Outline
Collecting functional information: Small scale (single gene) Large scale (sets of genes)
Function annotation schemes Problems with functional assignments [Comparing current schemes]
Collecting information for single genes
from 1° databases from 2° databases from Genome Databases (Model organisms) by homology not by homology
Annotation in databases: 1° and 2° databases
Some information can be found in 'primary' databases (sequence and structure databases)
Usually limited although sometimes can be quite informative (e.g. SwissProt)
Core data: sequence, citation information and taxonomic data Annotation: Protein function; post-translational modifications;
domains and sites; Associated diseases; Sequence conflicts/Variant
Most primary databases link to a number of value-added (2°) databases (e.g. motif databases or disease databases) which are often rich in information
Annotation in 1° databases: SwissProtID HEM3_HUMAN STANDARD; PRT; 361 AA.
AC P08397; P08396; Q16012;
…
DE PORPHOBILINOGEN DEAMINASE (EC 4.3.1.8) (HYDROXYMETHYLBILANE SYNTHASE)
DE (HMBS) (PRE-UROPORPHYRINOGEN SYNTHASE) (PBG-D).
GN HMBS OR PBGD.
OS Homo sapiens (Human).
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
…(literature references)…
CC FUNCTION: TETRAPOLYMERIZATION OF THE MONOPYRROLE PBG INTO THE
CC HYDROXYMETHYLBILANE PREUROPORPHYRINOGEN IN SEVERAL DISCRETE STEPS.
CC CATALYTIC ACTIVITY: 4 PORPHOBILINOGEN + H(2)O = HYDROXYMETHYLBILANE + 4 NH(3).
CC COFACTOR: BINDS A DIPYRROMETHANE COFACTOR TO WHICH PORPHOBILINOGEN SUBUNITS…
CC PATHWAY: THIRD STEP IN PORPHYRIN BIOSYNTHESIS BY THE SHEMIN PATHWAY. INVOLVED…
CC ALTERNATIVE PRODUCTS: THERE ARE TWO ISOZYMES OF THIS ENZYME IN MAMMALS; THEY
CC AREPRODUCED BY THE SAME GENE FROM ALTERNATIVE SPLICING…
CC DISEASE: DEFECTS IN HMBS ARE THE CAUSE OF ACUTE INTERMITTENT PORPHYRIA (AIP); AN
CC AUTOSOMAL DOMINANT DISEASE CHARACTERIZED BY ACUTE ATTACKS OF NEUROLOGICAL
CC DYSFUNCTION…
CC SIMILARITY: BELONGS TO THE HMBS FAMILY.
… (links to related databases - secondary databases) …
KW Porphyrin biosynthesis; Heme biosynthesis; Lyase;
KW Alternative splicing; Disease mutation.
…
(Sequence variations/Sequence)
Genome databases
Some deal with single organisms (e.g. SubtiList for B. subtilis; Sanger Centre M. tuberculosis)
Some deal with multiple genomes (e.g. TIGR microbial genomes database)
The level of annotation can be extensive Many are much more than sequence repositories
extending the sequence with tons of information (e.g. mutants; strains; complementation plasmids etc.)
If you are working with a model organism, chances of obtaining reliable functional annotations are improved
Function assignment by homology I
If you just have a sequence The most common bioinformatics procedure Search your protein of interest against primary
databases; chances are if you find a homologue with high-identity, it performs a similar function
Many, many tools (BLAST, FASTA, S-W Search)
Beware of annotation by homology relationship between seq. similarity and function not
straightforward danger of propagation of incorrect functional information
Function assignment by homology II
Consider databases which distinguish experimental function assignments from homology based ones (e.g. YPD/WormPD, EcoCyc)
Or use databases which employ more rigorous automated annotation tools (e.g. HAMAP @ SwissProt)
“Among the peculiarities recognized by the programs are: size discrepancy, absence or mutation of regions involved in activity or binding (to metals, nucleotides, etc), presence of paralogs, contradiction with the biological context (i.e. if a protein belongs to a pathway supposed to be absent in a particular organism), etc. Such "problematic" proteins will not be automatically annotated.”
Functional assignment “without homology”
Novel functional assignment methods now exists which don’t make use of ‘direct’ homology searches
They exploit other relationships between proteins which are used as indicators of shared function Phylogenetic profiles “Rosetta stone genes”
Phylogenetic profiles
Pellegrini M et al.,“Assigning proteinfunctions by comparativegenome analysis: proteinphylogenetic profiles.”PNAS (1999) 96(8):4285-8
More methods…
Marcotte EM, et al., Nature (1999) 402:83-86Enright AJ, et al., Nature (1999) 404:86-90
Functional assignment “without homology”
Some access over the WWW but experiemental and only for certain organisms (Yeast, E. coli, M.
tuberculosis) many proprietary methods
Considered one of the most promising solution for preliminary annotation of “unknown function” proteins in genome sequencing projects
Collecting information for many genes
Usually for “large-scale biology” (e.g. micro-array experiments)
Genome Databases Functional classification schemes
Genome Databases
Genome sequencing project are now the primary driving force for extensive functional annotation
We have the genes (ORFs), we want the functions
FUNCTIONAL GENOMICS
Functional classification schemes I
Dealing with large sets of genes functional classification schemes
Tentative schemes as early as 1983; use driven by genome sequencing projects
First extensive scheme published in 1993 by Monica Riley [regularly updated (GenProtEC; EcoCyc)]
The majority of current schemes are heavily influenced by the ‘Riley scheme’
‘2nd generation’ schemes are now being developed
Functional classification schemes II
Most schemes can be thought of as trees Progression along the tree (root to leaves) represents
increasingly specific functions ORFs are generally associated with leaf nodes (but of
course, they are also associated with intermediary nodes)
Examples of use: create gene sets linked by functionality (e.g. to detect
functional motifs) validate a functional connection between genes (e.g. gene
expression studies)
(112 ORFs)
An example scheme…
Metabolism of small molecules
Amino Acids
Central Intermediary Metabolism
Energy Metabolism
etc.
Aerobic respiration
Fermentation
Glycolysis
etc.
etc.
etc.
Alanine
Amino sugars
2 ORFs
8 ORFs
32 ORFs
18 ORFs
22 ORFs
(900 ORFs)
GeneProtEC
Issues: Apples and Oranges
Function is an umbrella catch-all term Schemes do not distinguish between aspects of
functions Most commonly they mix gene product type (T),
activity (A) and cellular role (R)
Cell division (R) : DNA replication (A)
Osmotic adaptation (R) : Ion channel (T,A)
Issues - Multi-dimensionality I
Human trypsin functions: Biochemical: peptide bond hydrolysis Molecular: proteolytic enzyme Cellular: protein degradation Physiological: digestion
Could conceive a number of other dimensions Cellular location Regulation
Issues - Multi-dimensionality II
Why differentiate function and process? Figure of cell cycle-dependent Yeast gene expression
clusters (Pat Brown lab - Stanford)
Issues - Multi-functionality
Inherent: e.g. lac repressor; carbohydrate metabolism and osmoprotection
Multi-subunit: e.g. succinic dehydrogenase; whole - enzyme in TCA; subunit 1 - electron transport chain; subunit 2 - cell structure
Circumstantial: e.g. acetate kinase; acetate only environment - acetate metabolism; acetate absent - fermentation enzyme
Gene Ontology - a collaboration
Drosophila (fruit fly) - FlyBase Saccharomyces Genome Database (SGD) Mus (mouse) - Mouse Genome Database (MGD)
Gene Ontology - the next generation
Multi-dimensional: functional primitive: “a capability that a physical gene product
(or gene product group) carries as a potential” (e.g. transporter or adenylate cyclase)
process: “a biological objective accomplished via one or more ordered assemblies of functions” (e.g. cell growth and maintenance or purine metabolism)
cellular component
Extensive: depth 11; nearly 4000 terms More complex organisation: away from tree structure Theoretically applicable to all species (designed for
multicellular eukaryotes)
Where to look for functional information - single protein
With 1 or a few genes: Primary databases (e.g. SwissProt) Model organism databases (e.g. GenProtEC; SGD;
WormPD) Metabolic/Pathway databases (e.g. KEGG) Value-added databases (e.g. Motif databases; Disease
databases)
By homology
Not by homology
Where to look for functional information - protein sets
Need some sort of functional classification scheme: Tree like schemes (e.g. TIGR, GenProtEC) Gene Ontology (FlyBase, MGD, SGD)
For comparative genomics, need schemes applied to multiple organisms (e.g. PEDANT, TIGR)
Currently, greatest genome coverage is by PEDANT (but non-manually curated)
Conclusions
Functional information is available but it is rarely centralised
Function is a very broad definition; hard to know if the information you need will be available at the level you need it
New schemes (e.g. GO) are emerging which try and cope with functional annotation better
And new automated functional annotation tools are emerging (‘intelligent systems’; non-homology based)
You still need to validate predictions experimentally
A survey of (some) current schemes
1) EcoCyc/GenProtEC: E. coli scheme (Riley scheme, MBL) 2) SubtiList: Bacillus subtilis scheme (Institut Pasteur) 3) MIPS/PEDANT: yeast scheme (applied to other organisms in
PEDANT) (Munich Institute for Protein Science) 4) TIGR: microbial genomes scheme (The Institute for Genome
Research) 5) KEGG: multi-organism scheme (metabolic and regulatory
pathways) (Kyoto Encyclopaedia for Genes and Genomes) 6) WIT: multi-organism scheme (metabolic reconstruction) (What
is There; ANL) 7) Gene Ontology: a 2nd generation functional classification
scheme (EBI; FlyBase; MGD; SGD)
Conclusions - Scheme comparison I
Similar in the coverage of function (although very varying ‘granularity’)
...yet different enough that direct comparison complex
Essentially deal with unicellular microbial organisms (MIPS is tackling this)
Certain ‘niche’ schemes (e.g. WIT/KEGG) ...or user community tailored schemes (e.g.
SubtiList)
WWW sites I
Primary databases (Sequence): SwissProt:
http://www.expasy.ch/sprot PIR:
http://www-nbrf.Georgetown.edu/ NCBI databases:
http://www.ncbi.nlm.nih.gov/Database/index.html
Primary databases (Structure) Protein Data Bank:
http://www.rcsb.org/ Macromolecular Structure Database:
http://msd.ebi.ac.uk/
Value added: INTERPRO:
http://interpro.ebi.ac.uk/
WWW sites II
Single genome databases: Subtilist:
http://genolist.pasteur.fr/SubtiList/ Saccharomyces Genome Database:
http://genomewww.stanford.edu/Saccharomyces/ EcoCyc:
http://ecocyc.pangeasystems.com/ GenProtEC:
http://genprotec.mdbl.edu/ FlyBase:
http://flybase.bio.indiana.edu/ Mouse Genome Database (MGD):
http://www.informatics.jax.org/ Yeast Protein Database (YPD) and WormPD:
http://www.proteome.com/
WWW sites III Multiple genome databases
The Institute for Genome Research: http://www.tigr.org/microbialdb
MIPS/PEDANT: http://pedant.mips.biochem.mpg.de/
HAMAP: http://www.expasy.ch/sprot/hamap/
Pathway databases KEGG:
http://www.genome.ad.jp/kegg/ WIT:
http://igweb.integratedgenomics.com/IGwit/
Non-homology based function prediction Mycobacterium tuberculosis:
http://www.doe-mbi.ucla.edu/people/sergio/TB/tb.html Yeast:
http://www.doe-mbi.ucla.edu/people/marcotte/yeast.html
A relevant paper http://www.biochem.ucl.ac.uk/~rison/Publications/index.html