Automated Annotation of Microbial Genomes, Opportunities and Pitfalls

Post on 31-Dec-2015

25 views 5 download

Tags:

description

Automated Annotation of Microbial Genomes, Opportunities and Pitfalls. Margie Romine Pacific Northwest National Laboratory Richland, Washington. Shewanella oneidensis MR-1. Breathes Mn & Fe and other metals thereby changing their solubility - PowerPoint PPT Presentation

Transcript of Automated Annotation of Microbial Genomes, Opportunities and Pitfalls

Automated Annotation of Microbial Genomes,

Opportunities and Pitfalls

Margie RominePacific Northwest

National LaboratoryRichland, Washington

Shewanella oneidensis MR-1

• Breathes Mn & Fe and other metals thereby changing their solubility

• Also reduces radionuclides and hence impacts their mobility at contaminated sites

• Genome sequenced by the Institute for Genome Research in 2002 (funded by DOE-OBER)

• Can we now better determine how this organism interacts with metals and radionuclides?

Shewanella spp. Inhabit Many Niches

•Energy rich - fermentation is occurring and energy is continuously being deposited via sedimentation

•Rapidly changing redox conditions/dominant electron acceptors•Microbial partners are present to remove the acetate

produced via anaerobic respiration.

2 more were sequenced by DOE’s

Joint Genome Institute and 14 more are under way!

Bacterial Genome Sequencing Explodes

• 341 completed genomes, 976 ongoing• Partial genome sequences released in

just days now by JGI!• How do we use sequence information

to understand how all these organisms function in the environment?

• Annotation is the key, but is now largely automated and hence of lower quality

Locate genes

Assign putative functions

What is Annotation?

AGCTTAACTGGGATACGACGACCAGTAGACAGGTRTACGATGAGATATATAT

Translate to proteins

Gather Evidence of function

MASDLKKIYTRPRPDSAWQECVAALFDGHSKDKLACNDDL

Annotation Drives Post-genomic Research

DNA microarrays

Proteomics

Gene prediction

s

Protein prediction

s

Targeted gene knock-outs

ChiP-Chip

Function predictions

mRNA expression

DNA binding sites

Protein expression

Methodologies Data Interpretation

Metabolic modeling

Hypothesis

Annotation with Gnare/Puma2

• Developed at Argonne National Laboratory by Natalia Maltsev, Mark D’Souza, Elizabeth Glass, Dina Sulakhe, Mustafa Syed, Pavan Anumula

• http://compbio.mcs.anl.gov/puma2/cgi-bin/index.cgi

• Gnare – Private genome sequences• Puma2 – Public genome sequences

Types of Functional Descriptors

• Hypothetical protein• Conserved hypothetical protein• Conserved domain protein• Function associated protein• Class specific enzyme• Specific function predicted• Function validated

Go to Puma page for homolog

Checking Functions Where No Domain Hit Occurs

type IV secretion outer membrane protein, PilW?

Shewanella oneidensis MR-1

MKNCQKG

Domain identified

Align proteins

This is a family of hypothetical proteins. A number of the sequence records state they are transmembrane proteins or putative permeases. It is not clear what source suggested that these proteins might be permeases and this information should be treated with caution.

2.A.86 The Autoinducer-2 Exporter (AI-2E) FamilyThe AI-2E family (UPF0118) is a large family of prokaryotic proteins derived from a variety of bacteria and archaea. Those examined are about 350 residues in length, and the couple that have been examined exhibit 7 putative transmembrane α-helical spanners (TMSs). E. coli, B. subtilis and several other prokaryotes have multiple paralogues encoded within their genomes. Herzberg et al. (2006) have presented strong evidence for a role of a AI-2E family homologue, YdgG (renamed TqsA), as an exporter of the E. coli autoinducer-2 (AI-2) (Camilli and Bassler, 2006; Chen et al., 2002). AI-2 is a proposed signalling molecule for interspecies communication in bacteria. It is a furanosyl borate diester (Chen et al., 2002).

autoinducer-2 transport protein, TqsA

Clues in Interpro Domain Descriptor

No functional clues

Using Genome Context to Predict Function

Clusters with N-acetyl glucosame catabolic enzymes

Missing enzyme

Hypothesis experimentally validated!

Precomputed text mining

General enzyme function

sulfite dehydrogenase catalytic molybdopterin subunit, SorA

Relevant abstracts mentioning your query species (Shewanella oneidensis)

Domain hit does not match current annotation

propogated in automated

annotations!!!

Mistake in Interpro Database found!

More Automation in Evidence Collecting Needed

Protein Location Linked to Function

cytoplasm

periplasm

outer membrane

inner membrane

extracellular

peptidoglycan

cytoplasm

Multiple Routes of Secretion

++ LXGC

+++ G P AXA

X

++ K/RRXFXK AXA X

F E G

LepB

LepB

LspA

PilD

GG C39

Bioinformatics Tools for Localization Prediction

LipoPLipoPsort

SosuiTmHMMPhobius

PsortHMMTOP

SubLocCelloPsort

Secretome

ProfTMBBompBBTM

barrelLspA

IM TMPsortLipoPPredsi

PhobiusSignalP

TatP

LepB

• Incorrect start sites have strong impact on predictions!

• Different tools have unique specialties

• No one tool provides good predictions for all proteins

Example: c type cytochromes

• Contain CXXCH motif for binding heme…so do some other proteins that

are not c type cytochromes • All are secreted across the inner

membrane and then assembled• 60 proteins in MR-1 have CXXCH• Only 43 have a leader peptide and

are predicted to be c type cytochromes

Future Needs in Annotation Automation

• Current methods of automated annotation will lead to propagation of annotation errors and burying of useful evidence

• But manual annotation cannot keep up with rate at which sequences are produced

• Additional automations are needed!– Protein localization– Specialty database mining (TCDB, merops, etc)– Experimental data mining – appropriate

databases don’t exist