Protein function Where to find it. How to predict it. How to classify it. Stuart Rison Department of...

Protein functionWhere to find it.How to predict it.How to classify it.

Stuart Rison

Department of Biochemistry, UCL

[email protected]

Outline

Collecting functional information: Small scale (single gene) Large scale (sets of genes)

Function annotation schemes Problems with functional assignments [Comparing current schemes]

Collecting information for single genes

from 1° databases from 2° databases from Genome Databases (Model organisms) by homology not by homology

Annotation in databases: 1° and 2° databases

Some information can be found in 'primary' databases (sequence and structure databases)

Usually limited although sometimes can be quite informative (e.g. SwissProt)

Core data: sequence, citation information and taxonomic data Annotation: Protein function; post-translational modifications;

domains and sites; Associated diseases; Sequence conflicts/Variant

Most primary databases link to a number of value-added (2°) databases (e.g. motif databases or disease databases) which are often rich in information

Annotation in 1° databases: SwissProtID HEM3_HUMAN STANDARD; PRT; 361 AA.

AC P08397; P08396; Q16012;

…

DE PORPHOBILINOGEN DEAMINASE (EC 4.3.1.8) (HYDROXYMETHYLBILANE SYNTHASE)

DE (HMBS) (PRE-UROPORPHYRINOGEN SYNTHASE) (PBG-D).

GN HMBS OR PBGD.

OS Homo sapiens (Human).

OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.

…(literature references)…

CC FUNCTION: TETRAPOLYMERIZATION OF THE MONOPYRROLE PBG INTO THE

CC HYDROXYMETHYLBILANE PREUROPORPHYRINOGEN IN SEVERAL DISCRETE STEPS.

CC CATALYTIC ACTIVITY: 4 PORPHOBILINOGEN + H(2)O = HYDROXYMETHYLBILANE + 4 NH(3).

CC COFACTOR: BINDS A DIPYRROMETHANE COFACTOR TO WHICH PORPHOBILINOGEN SUBUNITS…

CC PATHWAY: THIRD STEP IN PORPHYRIN BIOSYNTHESIS BY THE SHEMIN PATHWAY. INVOLVED…

CC ALTERNATIVE PRODUCTS: THERE ARE TWO ISOZYMES OF THIS ENZYME IN MAMMALS; THEY

CC AREPRODUCED BY THE SAME GENE FROM ALTERNATIVE SPLICING…

CC DISEASE: DEFECTS IN HMBS ARE THE CAUSE OF ACUTE INTERMITTENT PORPHYRIA (AIP); AN

CC AUTOSOMAL DOMINANT DISEASE CHARACTERIZED BY ACUTE ATTACKS OF NEUROLOGICAL

CC DYSFUNCTION…

CC SIMILARITY: BELONGS TO THE HMBS FAMILY.

… (links to related databases - secondary databases) …

KW Porphyrin biosynthesis; Heme biosynthesis; Lyase;

KW Alternative splicing; Disease mutation.

…

(Sequence variations/Sequence)

Annotation in Motif databases: INTERPRO

http://interpro.ebi.ac.uk/servlet/IEntry?ac=IPR000860

Genome databases

Some deal with single organisms (e.g. SubtiList for B. subtilis; Sanger Centre M. tuberculosis)

Some deal with multiple genomes (e.g. TIGR microbial genomes database)

The level of annotation can be extensive Many are much more than sequence repositories

extending the sequence with tons of information (e.g. mutants; strains; complementation plasmids etc.)

If you are working with a model organism, chances of obtaining reliable functional annotations are improved

Genome database: YPDhttp://www.proteome.com/databases/YPD/reports/HEM3.html

Function assignment by homology I

If you just have a sequence The most common bioinformatics procedure Search your protein of interest against primary

databases; chances are if you find a homologue with high-identity, it performs a similar function

Many, many tools (BLAST, FASTA, S-W Search)

Beware of annotation by homology relationship between seq. similarity and function not

straightforward danger of propagation of incorrect functional information

Function assignment by homology II

Consider databases which distinguish experimental function assignments from homology based ones (e.g. YPD/WormPD, EcoCyc)

Or use databases which employ more rigorous automated annotation tools (e.g. HAMAP @ SwissProt)

“Among the peculiarities recognized by the programs are: size discrepancy, absence or mutation of regions involved in activity or binding (to metals, nucleotides, etc), presence of paralogs, contradiction with the biological context (i.e. if a protein belongs to a pathway supposed to be absent in a particular organism), etc. Such "problematic" proteins will not be automatically annotated.”

Genome database: YPDhttp://www.proteome.com/databases/YPD/reports/HEM3.html

Functional assignment “without homology”

Novel functional assignment methods now exists which don’t make use of ‘direct’ homology searches

They exploit other relationships between proteins which are used as indicators of shared function Phylogenetic profiles “Rosetta stone genes”

Phylogenetic profiles

Pellegrini M et al.,“Assigning proteinfunctions by comparativegenome analysis: proteinphylogenetic profiles.”PNAS (1999) 96(8):4285-8

Rosetta Stone method

More methods…

Marcotte EM, et al., Nature (1999) 402:83-86Enright AJ, et al., Nature (1999) 404:86-90


Some access over the WWW but experiemental and only for certain organisms (Yeast, E. coli, M.

tuberculosis) many proprietary methods

Considered one of the most promising solution for preliminary annotation of “unknown function” proteins in genome sequencing projects

Collecting information for many genes

Usually for “large-scale biology” (e.g. micro-array experiments)

Genome Databases Functional classification schemes

Genome Databases

Genome sequencing project are now the primary driving force for extensive functional annotation

We have the genes (ORFs), we want the functions

FUNCTIONAL GENOMICS

(… more ’omes)

Functional classification schemes I

Dealing with large sets of genes functional classification schemes

Tentative schemes as early as 1983; use driven by genome sequencing projects

First extensive scheme published in 1993 by Monica Riley [regularly updated (GenProtEC; EcoCyc)]

The majority of current schemes are heavily influenced by the ‘Riley scheme’

‘2nd generation’ schemes are now being developed

Functional classification schemes II

Most schemes can be thought of as trees Progression along the tree (root to leaves) represents

increasingly specific functions ORFs are generally associated with leaf nodes (but of

course, they are also associated with intermediary nodes)

Examples of use: create gene sets linked by functionality (e.g. to detect

functional motifs) validate a functional connection between genes (e.g. gene

expression studies)

(112 ORFs)

An example scheme…

Metabolism of small molecules

Amino Acids

Central Intermediary Metabolism

Energy Metabolism

etc.

Aerobic respiration

Fermentation

Glycolysis

etc.

etc.

etc.

Alanine

Amino sugars

2 ORFs

8 ORFs

32 ORFs

18 ORFs

22 ORFs

(900 ORFs)

GeneProtEC

Issues

Functions: Apple and OrangesMulti-dimensionalityMulti-functionality

Issues: Apples and Oranges

Function is an umbrella catch-all term Schemes do not distinguish between aspects of

functions Most commonly they mix gene product type (T),

activity (A) and cellular role (R)

Cell division (R) : DNA replication (A)

Osmotic adaptation (R) : Ion channel (T,A)

Issues - Multi-dimensionality I

Human trypsin functions: Biochemical: peptide bond hydrolysis Molecular: proteolytic enzyme Cellular: protein degradation Physiological: digestion

Could conceive a number of other dimensions Cellular location Regulation

Issues - Multi-dimensionality II

Why differentiate function and process? Figure of cell cycle-dependent Yeast gene expression

clusters (Pat Brown lab - Stanford)

Issues - Multi-functionality

Inherent: e.g. lac repressor; carbohydrate metabolism and osmoprotection

Multi-subunit: e.g. succinic dehydrogenase; whole - enzyme in TCA; subunit 1 - electron transport chain; subunit 2 - cell structure

Circumstantial: e.g. acetate kinase; acetate only environment - acetate metabolism; acetate absent - fermentation enzyme

Gene Ontology - a collaboration

Drosophila (fruit fly) - FlyBase Saccharomyces Genome Database (SGD) Mus (mouse) - Mouse Genome Database (MGD)

Gene Ontology - the next generation

Multi-dimensional: functional primitive: “a capability that a physical gene product

(or gene product group) carries as a potential” (e.g. transporter or adenylate cyclase)

process: “a biological objective accomplished via one or more ordered assemblies of functions” (e.g. cell growth and maintenance or purine metabolism)

cellular component

Extensive: depth 11; nearly 4000 terms More complex organisation: away from tree structure Theoretically applicable to all species (designed for

multicellular eukaryotes)

Gene Ontology - Process

Gene Ontology - current status

http://www.geneontology.org/

Where to look for functional information - single protein

With 1 or a few genes: Primary databases (e.g. SwissProt) Model organism databases (e.g. GenProtEC; SGD;

WormPD) Metabolic/Pathway databases (e.g. KEGG) Value-added databases (e.g. Motif databases; Disease

databases)

By homology

Not by homology

Where to look for functional information - protein sets

Need some sort of functional classification scheme: Tree like schemes (e.g. TIGR, GenProtEC) Gene Ontology (FlyBase, MGD, SGD)

For comparative genomics, need schemes applied to multiple organisms (e.g. PEDANT, TIGR)

Currently, greatest genome coverage is by PEDANT (but non-manually curated)

Conclusions

Functional information is available but it is rarely centralised

Function is a very broad definition; hard to know if the information you need will be available at the level you need it

New schemes (e.g. GO) are emerging which try and cope with functional annotation better

And new automated functional annotation tools are emerging (‘intelligent systems’; non-homology based)

You still need to validate predictions experimentally

A survey of (some) current schemes

1) EcoCyc/GenProtEC: E. coli scheme (Riley scheme, MBL) 2) SubtiList: Bacillus subtilis scheme (Institut Pasteur) 3) MIPS/PEDANT: yeast scheme (applied to other organisms in

PEDANT) (Munich Institute for Protein Science) 4) TIGR: microbial genomes scheme (The Institute for Genome

Research) 5) KEGG: multi-organism scheme (metabolic and regulatory

pathways) (Kyoto Encyclopaedia for Genes and Genomes) 6) WIT: multi-organism scheme (metabolic reconstruction) (What

is There; ANL) 7) Gene Ontology: a 2nd generation functional classification

scheme (EBI; FlyBase; MGD; SGD)

FuncWheel for the Combination Scheme

Conclusions - Scheme comparison I

Similar in the coverage of function (although very varying ‘granularity’)

...yet different enough that direct comparison complex

Essentially deal with unicellular microbial organisms (MIPS is tackling this)

Certain ‘niche’ schemes (e.g. WIT/KEGG) ...or user community tailored schemes (e.g.

SubtiList)

WWW sites I

Primary databases (Sequence): SwissProt:

http://www.expasy.ch/sprot PIR:

http://www-nbrf.Georgetown.edu/ NCBI databases:

http://www.ncbi.nlm.nih.gov/Database/index.html

Primary databases (Structure) Protein Data Bank:

http://www.rcsb.org/ Macromolecular Structure Database:

http://msd.ebi.ac.uk/

Value added: INTERPRO:

http://interpro.ebi.ac.uk/

WWW sites II

Single genome databases: Subtilist:

http://genolist.pasteur.fr/SubtiList/ Saccharomyces Genome Database:

http://genomewww.stanford.edu/Saccharomyces/ EcoCyc:

http://ecocyc.pangeasystems.com/ GenProtEC:

http://genprotec.mdbl.edu/ FlyBase:

http://flybase.bio.indiana.edu/ Mouse Genome Database (MGD):

http://www.informatics.jax.org/ Yeast Protein Database (YPD) and WormPD:

http://www.proteome.com/

WWW sites III Multiple genome databases

The Institute for Genome Research: http://www.tigr.org/microbialdb

MIPS/PEDANT: http://pedant.mips.biochem.mpg.de/

HAMAP: http://www.expasy.ch/sprot/hamap/

Pathway databases KEGG:

http://www.genome.ad.jp/kegg/ WIT:

http://igweb.integratedgenomics.com/IGwit/

Non-homology based function prediction Mycobacterium tuberculosis:

http://www.doe-mbi.ucla.edu/people/sergio/TB/tb.html Yeast:

http://www.doe-mbi.ucla.edu/people/marcotte/yeast.html

A relevant paper http://www.biochem.ucl.ac.uk/~rison/Publications/index.html

Protein function Where to find it. How to predict it. How to classify it. Stuart Rison Department of...

Documents

Transcript of Protein function Where to find it. How to predict it. How to classify it. Stuart Rison Department of...