Automated Annotation of Microbial Genomes,
Opportunities and Pitfalls
Margie RominePacific Northwest
National LaboratoryRichland, Washington
Shewanella oneidensis MR-1
• Breathes Mn & Fe and other metals thereby changing their solubility
• Also reduces radionuclides and hence impacts their mobility at contaminated sites
• Genome sequenced by the Institute for Genome Research in 2002 (funded by DOE-OBER)
• Can we now better determine how this organism interacts with metals and radionuclides?
Shewanella spp. Inhabit Many Niches
•Energy rich - fermentation is occurring and energy is continuously being deposited via sedimentation
•Rapidly changing redox conditions/dominant electron acceptors•Microbial partners are present to remove the acetate
produced via anaerobic respiration.
2 more were sequenced by DOE’s
Joint Genome Institute and 14 more are under way!
Bacterial Genome Sequencing Explodes
• 341 completed genomes, 976 ongoing• Partial genome sequences released in
just days now by JGI!• How do we use sequence information
to understand how all these organisms function in the environment?
• Annotation is the key, but is now largely automated and hence of lower quality
Locate genes
Assign putative functions
What is Annotation?
AGCTTAACTGGGATACGACGACCAGTAGACAGGTRTACGATGAGATATATAT
Translate to proteins
Gather Evidence of function
MASDLKKIYTRPRPDSAWQECVAALFDGHSKDKLACNDDL
Annotation Drives Post-genomic Research
DNA microarrays
Proteomics
Gene prediction
s
Protein prediction
s
Targeted gene knock-outs
ChiP-Chip
Function predictions
mRNA expression
DNA binding sites
Protein expression
Methodologies Data Interpretation
Metabolic modeling
Hypothesis
Annotation with Gnare/Puma2
• Developed at Argonne National Laboratory by Natalia Maltsev, Mark D’Souza, Elizabeth Glass, Dina Sulakhe, Mustafa Syed, Pavan Anumula
• http://compbio.mcs.anl.gov/puma2/cgi-bin/index.cgi
• Gnare – Private genome sequences• Puma2 – Public genome sequences
Types of Functional Descriptors
• Hypothetical protein• Conserved hypothetical protein• Conserved domain protein• Function associated protein• Class specific enzyme• Specific function predicted• Function validated
Go to Puma page for homolog
Checking Functions Where No Domain Hit Occurs
type IV secretion outer membrane protein, PilW?
Shewanella oneidensis MR-1
MKNCQKG
Domain identified
Align proteins
This is a family of hypothetical proteins. A number of the sequence records state they are transmembrane proteins or putative permeases. It is not clear what source suggested that these proteins might be permeases and this information should be treated with caution.
2.A.86 The Autoinducer-2 Exporter (AI-2E) FamilyThe AI-2E family (UPF0118) is a large family of prokaryotic proteins derived from a variety of bacteria and archaea. Those examined are about 350 residues in length, and the couple that have been examined exhibit 7 putative transmembrane α-helical spanners (TMSs). E. coli, B. subtilis and several other prokaryotes have multiple paralogues encoded within their genomes. Herzberg et al. (2006) have presented strong evidence for a role of a AI-2E family homologue, YdgG (renamed TqsA), as an exporter of the E. coli autoinducer-2 (AI-2) (Camilli and Bassler, 2006; Chen et al., 2002). AI-2 is a proposed signalling molecule for interspecies communication in bacteria. It is a furanosyl borate diester (Chen et al., 2002).
autoinducer-2 transport protein, TqsA
Clues in Interpro Domain Descriptor
No functional clues
Using Genome Context to Predict Function
Clusters with N-acetyl glucosame catabolic enzymes
Missing enzyme
Hypothesis experimentally validated!
Precomputed text mining
General enzyme function
sulfite dehydrogenase catalytic molybdopterin subunit, SorA
Relevant abstracts mentioning your query species (Shewanella oneidensis)
Domain hit does not match current annotation
propogated in automated
annotations!!!
Mistake in Interpro Database found!
More Automation in Evidence Collecting Needed
Protein Location Linked to Function
cytoplasm
periplasm
outer membrane
inner membrane
extracellular
peptidoglycan
cytoplasm
Multiple Routes of Secretion
++ LXGC
+++ G P AXA
X
++ K/RRXFXK AXA X
F E G
LepB
LepB
LspA
PilD
GG C39
Bioinformatics Tools for Localization Prediction
LipoPLipoPsort
SosuiTmHMMPhobius
PsortHMMTOP
SubLocCelloPsort
Secretome
ProfTMBBompBBTM
barrelLspA
IM TMPsortLipoPPredsi
PhobiusSignalP
TatP
LepB
• Incorrect start sites have strong impact on predictions!
• Different tools have unique specialties
• No one tool provides good predictions for all proteins
Example: c type cytochromes
• Contain CXXCH motif for binding heme…so do some other proteins that
are not c type cytochromes • All are secreted across the inner
membrane and then assembled• 60 proteins in MR-1 have CXXCH• Only 43 have a leader peptide and
are predicted to be c type cytochromes
Future Needs in Annotation Automation
• Current methods of automated annotation will lead to propagation of annotation errors and burying of useful evidence
• But manual annotation cannot keep up with rate at which sequences are produced
• Additional automations are needed!– Protein localization– Specialty database mining (TCDB, merops, etc)– Experimental data mining – appropriate
databases don’t exist
Top Related