MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics...

31
MGM workshop. 19 Oct 2010 Functional annotation Functional annotation Datasources Datasources Konstantinos Mavrommatis Konstantinos Mavrommatis Head of Omics group Head of Omics group DOE-JGI DOE-JGI [email protected] [email protected]

Transcript of MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics...

Page 1: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

Functional annotationFunctional annotationDatasourcesDatasources

Konstantinos MavrommatisKonstantinos MavrommatisHead of Omics groupHead of Omics group

DOE-JGIDOE-JGI

[email protected]@lbl.gov

Page 2: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

OutlineOutline

Genome annotation (Functional)

How do we know it is correct?

How do we do it?Data collectionsProtein familiesPathway collections

Page 3: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

Genome annotation: The Genome annotation: The process of identifying the process of identifying the locations and functions of locations and functions of

coding sequences.coding sequences.

cobalamin biosynthetic enzyme, cobalt-precorrin-4 methyltransferase (CbiF)

molecular/enzymatic (methyltransferase) Reaction (methylation)

Substrate (cobalt-precorrin-4)

Ligand (S-adenosyl-L-methionine)

metabolic (cobalamin biosynthesis)

physiological (maintenance of healthy nerve and red blood cells, through B12).

Page 4: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

Functional annotation Functional annotation helps make sense out of helps make sense out of

nonsensenonsense

But it only But it only directs us to directs us to the potential the potential

of the of the organismorganism

Page 5: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

Function prediction is Function prediction is mainly based on mainly based on

homology detectionhomology detection Homology

implies a common evolutionary origin.

not retention of similarity in any of their properties.

Homology ≠ similarity of function.

Function transfer by homology

Conservative amino acid substitution

Low complexity region

Gap (insertion or deletion)

Page 6: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

Function transfer based Function transfer based on homology is error on homology is error

proneprone

Punta & Ofran. PLOS Comp Biol. 2008

Page 7: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

Limits in transfer of Limits in transfer of annotation based on annotation based on

homologyhomology

Punta & Ofran. PLOS Comp Biol. 2008

Page 8: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

If no similarity is detected If no similarity is detected use alternative methods to use alternative methods to

predict function predict function

Subcellular localization

Gene context

Special sequence motifs features

Cytoplasm

S ~ S S ~ S

Periplasm

Page 9: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

Genome annotation

Model pathway

Annotation should make Annotation should make sense in the context of sense in the context of

the cell metabolismthe cell metabolism

SubstrateA

SubstrateB

SubstrateC

SubstrateDEnzyme 2Enzyme 1 Enzyme 3

Enzyme 2? ?Enzyme 1 Enzyme 3 ✓

Page 10: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

Annotation should make Annotation should make sense.sense.

Missing genes may be present.Missing genes may be present.

Page 11: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

Helps prediction

Is error prone.

Has to make sense.

Genome annotation: The Genome annotation: The process of identifying the process of identifying the locations and functions of locations and functions of

coding sequences.coding sequences.

Page 12: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

There are multiple There are multiple datasources to help datasources to help

organize information and organize information and facilitate annotationfacilitate annotation

Sequence databases

Protein classification databases

Specialized databases

Page 13: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

Primary databases store Primary databases store raw information from raw information from

various sourcesvarious sourcesEMBL/GenBank/DDBJ EMBL/GenBank/DDBJ ((http://

www.ncbi.nlm.nih.gov/,http://www.ebi.ac.uk/embl,http://www.ebi.ac.uk/embl))

Archive containing all sequences from all sources

GenBank/UnitProt contain translations of sequences.

Year Base pairs Sequences2004 44,575,745,17640,604,3192005 56,037,734,46252,016,7622006 69,019,290,70564,893,7472007 83,874,179,73080,388,3822008 99,116,431,94298,868,465

Page 14: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

Primary databases Primary databases accumulate errors in accumulate errors in

sequences and annotationssequences and annotations

In the sequences themselves:Sequencing errors.Cloning vector sequences.

In the annotations: Inaccuracies, omissions, and

even mistakes. Inconsistencies between some

fields. Redundancy. {

{

{

Page 15: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

IMG is using Refseq as its IMG is using Refseq as its primary sourceprimary source

ATTGACTA

TTGACA

CGTGA

ATTGACTA

TATAGCCG

ACGTGC

ACGTGCA

CGTGC

TTGACA

TTGACA

TTGACA

CGTGA

CGTGA

CGTGA

ATTGACTA

ATTGACTAATTGACTA

ATTGACTA

TATAGCCG

TATAGCCG

TATAGCCG

TATAGCCG

GenBank

TATAGCCG TATAGCCGTATAGCCG TATAGCCG

ATG

A

CATT

GA

GA

ATT

ATTC

GA

GA

ATTC

C

GA

GA

ATT

C

GAGA

ATT

C

GA

GA

ATTC

C

GA

GA

ATTC

C

UniGene

RefSeq

GenomeAssembly

Labs

Curators

Algorithms

TATAGCCGAGCTCCGATACCGATGACAA

Page 16: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

Protein families use Protein families use different methods to different methods to

classify proteins classify proteins

COG/KOG Pfam TIGRfam KEGG Orthologs InterPro

Page 17: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

What are COGs/KOGs? What are COGs/KOGs? How much can I trust How much can I trust

them?them?Reciprocal best hitBidirectional best hit

Blast best hitUnidirectional best hit

COG1COG2

>gnl|COG|2723 COG2723, BglB, Beta-glucosidase/6-phospho-beta-glucosidase/beta- galactosidase [Carbohydrate transport and metabolism]. Length = 460

Score = 388 bits (998), Expect = e-132 Identities = 176/503 (34%), Positives = 251/503 (49%), Gaps = 75/503 (14%)

Query: 4 SFPKSFRFGWSQAGFQSEMGTPGSEDPNTDWYVWVHDPENIASGLVSGDLPEHGPGYWGL 63 FPK F +G + A FQ E +DW VWVHD I LVSGD PE ++ Sbjct: 3 KFPKDFLWGGATAAFQVEGAWNEDGKGPSDWDVWVHDE--IPGRLVSGDPPEEASDFYHR 60

Query: 64 YRMFHDNAVKMGLDIARINVEWSRIFPKPMPDPPQGNVEVKGNDVLAVHVDENDLKRLDE 123 Y+ A +MGL+ R ++EWSRIFP Sbjct: 61 YKEDIALAKEMGLNAFRTSIEWSRIFPNGDGGEV-------------------------- 94

Query: 124 AANQEAVRHYREIFSDLKARGIHFILNFYHWPLPLWVHDPIRVRKGDLSGPTGWLDVKTV 183 N++ +R Y +F +LKARGI + YH+ LPLW+ P GW + +TVSbjct: 95 --NEKGLRFYDRLFDELKARGIEPFVTLYHFDLPLWLQKPYG----------GWENRETV 142

Query: 184 INFARFAAYTAWKFDDLADEYSTMNEPNVVHSNGYMWVKSGFPPSYLNFELSRRVMVNLI 243 FAR+AA +F D + T NEPNVV GY+ G PP ++ + + +V +++Sbjct: 143 DAFARYAATVFERFGDKVKYWFTFNEPNVVVELGYL--YGGHPPGIVDPKAAYQVAHHML 200

Query: 244 QAHARAYDAVKAISKK-PIGIIYANSSFTPLTDK--DAKAVELAEYDSRWIFFDAIIKGE 300 AHA A A+K I+ K +GII + PL+DK D KA E A+ F DA +KGESbjct: 201 LAHALAVKAIKKINPKGKVGIILNLTPAYPLSDKPEDVKAAENADRFHNRFFLDAQVKGE 260

Query: 301 --------------LMGVTRDDL----KGRLDWIGVNYYSRTVVKLIGEKSYVSIPGYGY 342 L + DL + +D+IG+NYY+ + VK + GYG Sbjct: 261 YPEYLEKELEENGILPEIEDGDLEILKENTVDFIGLNYYTPSRVK---AAEPRYVSGYGP 317

Page 18: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

http://pfam.sanger.ac.uk

HMMs of protein alignments (local) for domains, or global (cover whole protein)

Pfam are based on the Pfam are based on the detection of domains detection of domains

Page 19: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

TIGRfamTIGRfam

Full length alignments. Domain alignments. Equivalogs: families of

proteins with specific function.

Superfamilies: families of homologous genes.

HMMs

http://www.tigr.org/TIGRFAMs/

Page 20: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

Hits to other

models

How can we search Pfam How can we search Pfam and TIGRfam?and TIGRfam?

Query: BChl_A [M=357]Accession: PF02327.12Description: Bacteriochlorophyll A proteinScores for complete sequences (score includes all domains): --- full sequence --- --- best 1 domain --- -#dom- E-value score bias E-value score bias exp N Sequence Description ------- ------ ----- ------- ------ ----- ---- -- -------- ----------- 0.00014 11.2 0.0 0.00024 10.5 0.0 1.2 1 tr|E0STV9|E0STV9_IGNAA Glycoside hydrolase family 1

Domain annotation for each sequence (and alignments):>> tr|E0STV9|E0STV9_IGNAA Glycoside hydrolase family 1 OS=Ignisphaera aggregans (strain DSM) # score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc --- ------ ----- --------- --------- ------- ------- ------- ------- ------- ------- ---- 1 ! 10.5 0.0 1.1e-05 0.00024 217 273 .. 255 307 .. 240 321 .. 0.84

Alignments for each domain: == domain 1 score: 10.5 bits; conditional E-value: 1.1e-05 BChl_A 217 fshagsgvvdsisrwaelfpveklnkpasveagfrsdsqgievkvdgelpgvsvdag 273 fs+ g+v+si+ w l ++ + e gfr + iev v+g l v +d tr|E0STV9|E0STV9_IGNAA 255 FSKKPIGIVESIASWIPLREGDR----EAAEKGFRYNLWPIEVAVNGYLDDVYRDDL 307 899999*********98877765....3569*********************99864 PP

•GA Gathering method: Search threshold to build the full alignment.•TC Trusted Cutoff: Lowest sequence score and domain score of match in the full alignment.•NC Noise Cutoff: Highest sequence score and domain score of match not in full alignment.

Noise cutoff

Gathering cutoff

Trusted cutoff

Page 21: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

InterPro. Composite InterPro. Composite pattern databasespattern databases

To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource – InterPro

Release 30.0 (Dec10) contains 21178 entries Central annotation resource, with pointers to its satellite dbs

http://www.ebi.ac.uk/interpro/

Page 22: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

KEGG orthologyKEGG orthology

Xizeng Mao et al. Bioinformatics Volume 21,(2005)3787-3793

<10-5 evalue≤ rank 5≥ 70% query length≥ 30% identity

<10-5 evalue≤ rank 5≥ 70% query length≥ 30% identity

Page 23: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

ENZYMEENZYME

Page 24: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

Pathway collectionsPathway collectionsKEGGKEGG

Contains information about biochemical pathways, and protein interactions.

http://www.kegg.com

Page 25: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

Pathway collections:Pathway collections:MetacycMetacyc

Page 26: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

Functional annotationFunctional annotation

http://imgweb.jgi-psf.org/img_er_v260/doc/img_er_ann.pdf

Page 27: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

RNA structural and RNA structural and functional annotation are functional annotation are

coupled coupled

SILVA alignments of rRNAs are used to generate models

Covariance models for each RNA class are used to predict genes

Page 28: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

There is a plethora of There is a plethora of specialized databases that specialized databases that

one needs to searchone needs to search

http://www.oxfordjournals.org/nar/database/c

Page 29: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

In most cases In most cases databases are databases are

interconnected but …interconnected but …

SWISS-PROT

ENZYME

PDB

HSSP

SWISSNEW

YPDREF

YPD

PDBFINDERALI

DSSP

FSSP

NRL_3D

PMD

PIR

ProtFam

FlyGene

TFSITE

TFACTOR

EMBL

TrEMBL

ECDC

TrEMBLNEW

EMNEW

EPD

GenBank MOLPROBE

OMIM

MIMMAP

REBASE

PROSITE ProDom

PROSITEDOCBlocks

SWISSDOM

..not all databases are updated ..not all databases are updated regularly. regularly.

Changes of annotation in one Changes of annotation in one database are not reflected in database are not reflected in

othersothers

Page 30: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

There are multiple There are multiple datasources to help datasources to help

organize information and organize information and facilitate annotationfacilitate annotation

Sequence databases Contain sequences deposited by verious sources

Protein classification databases Utilize sequence homology or other criteria to

group together proteins COG, Pfam, TIGRfam, InterPro, KO terms

Specialized databases Start by searching for available resources

Page 31: MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov.

MGM workshop. 19 Oct 2010

Question? Question?

Genome annotation (Functional)

How do we know it is correct?

How do we do it?Data collectionsProtein familiesPathway collections