Bacterial genome annotation in the AGC group GENOSCOPE/CNRS UMR Structure et évolution des génomes...

Bacterial genome annotation in the AGC

group

GENOSCOPE/CNRS UMR “Structure et évolution des génomes”Dir. Jean Weissenbach

Claudine Médigue Atelier de Génomique Comparative

Meeting on Cenibacterium arsenoxidans annotation - 14/04/05

Annotation: A note, added by way of comment, or explanation.

What genes does this genome contain?

What is their location?

What proteins do they encode?

How are they regulated?

In what interactions and in what pathways do theprotein products participate?

Typical genome annotation questions:

What is genome annotation ?

détection par contenu

Three annotation level

Syntaxic/structural annotation• Location of genes (both protein- coding genes and RNA genes) • Location of regulatory signals• Location of other regions (such as repeats, etc)

EMBL

Functionnal annotation

• Biological function of the genes• Operators family SWISSPROT

Static view of the genome

Dynamic viewof the genome Process annotation (or relationnal)

• metabolic networks• regulatory processes• molecular assembly • …

How genomic objets are linked to build functionnal module, responsible for specific task in the cellsuch as :

Experimental results

L. Stein (2001)

What is genome annotation ?

AMIGene : CDS prediction in bacterial genomes

tRNA-scan : tRNA gene prediction (G. Fichant et al.)

findrRNA : rRNA gene finding

ProFED : Procaryotic Frameshift Error Detection

From the AGC group

AFC/Kmean : Statistical analysis (i.e, codon or oligonucleotide usage)

AMIMat : CDS prediction in bacterial genomes

From different authors

Petrin : rho indépendant terminators prediction (C. Term et al.)

Nosferatu : Closest or distants DNA repeats (E. Rocha et al.)

Spat : Pattern finding such as RBS, promoters, …(A. Viari et al.)

Oriloc : Cumulatif GC skew to predict the replication origin and terminus

Structural annotation tools

=>ORF more than 300 nt in length: probably not a random ORF

GTGGAATTGTGAGCGGATAACAATTTCACACAGGAAACAGCTATGACCGTGATTTTGGATTCA...GTCGTTTAACAACGTCG Stop A D N N S T Q E T A M T V I T D S V V Stop

GTGGAATTGTGAGCGGATAACAATTTCACACAGGAAACAGCTATGACCGTGATTTTGGATTCA...GTCGTTTAACAACGTCG Stop M T V I T D S V V Stop

ORF (Open Reading Frame)

Potential coding region

=> We used a statistical property of coding regions based on different compositions in oligonucleotides of length k between coding/non coding region.

GTGGAATTGTGAGCGGATAACAATTTCACACAGGAAACAGCTATGACCGTGATTTTGGATTCA...GTCGTTTAACAACGTCG Stop M T V I T D S V V Stop

Coding probablility ?

rbs Start candidates

Gene finding process

Startcodon

http://cwx.prenhall.com/horton/medialib/media_portfolio/

Ribosome binding sites (RBS)

RBS-finder (TIGR)

• Statistical model

i

A,C,G,T

kP(X/X1...Xk)

Transition probabilities

The probability that a nucleotide is in position I depends only on the type of the k preceeding nucleotides :

-3-2-1

+1+2+3

Learning step =>

• Practical use

start stopPcodant

w

phase 1

phase 2

phase 3

GeneMark (Borodovski)Glimmer (Salzberg)

Searching for stop/start codon patterns(RBS) + chaining constraints

Gene models

Gene finding : methods based on Markov Models

Longest ORFs extraction

"Glimmer-learn"

GeneMarkCOMPLETE GENOME

assimilation(coding versus coding)

(500 to 1000pb)

"Make-mat"

CodingSet of sequences :

Coding + Non coding

The matrix of transition probabilities is built by

discrimination(coding versus non coding)

Glimmer

Gene model (matrix) which reflect the codon usageof the coding regions

Set of sequences :

The matrix of transition probabilities is built by

How are built reference models in the learning step ?

E. coli

C. jejuni

E. coli gene model

the reference matrix used by the gene finding methods is very important !

+1

+2

+3

+1

+2

+3

-1

Example of gene prediction

-1

+1

+2

+3

-2

-3

Acinetobacter«natifs» gene model

The matrix used does not fit the codon usage of the genes founded in this part of the sequence

Horizontal transfer ?

Several existing problems • start codon assertion (non-ATG / alternatifs) • small genes detection • « atypical » genes

AMIGene (S. Bocs)

Annotation ofMIcrobialGenes

Building one or more gene models : AMIMat

Gene prediction using Markov Model(Such as GeneMark)

Heuristic for the selection of the most probable CDSs.

Heterogeneity in genomis sequences

AMIGene et les modèles de gènes …http://www.genoscope.cns.fr/agc/tools/amigene

Construction d’un modèle degènes à partir de la séquenceutilisateur (> 10 kb)

Utilisation des modèles de gènescalculés pour un ensemble degénomes (environ 80)

S. Cruveillerpresentation

Gene model construction : AMIMat strategy

?

« FONCTION » ?• rôle biochimique• rôle physiologique• mécanisme

• par similarité de séquence (criblage de banques)

• expérimental (gène rapporteur; expression différentielle...)

• par contexte (voisinage)

• « synténies »• métabolisme. …

Annotation fonctionnelle

From the AGC group

From different authors

AutoFAssign : Automatic functionnal assignation

Syntonizer : Synteny group detection

InterProScan : Searching for functionnal domains in Prosite, PFAM, PRODOM databanks

Cognitor : Finding similarities in the Cluster of Orthologous Genes (COG classification)

BlastP : Similarities searches in protein databanks and alignments

Also used for orthologs and paralogs identification

SignalP /TMhmm : Peptide signal and Transmembrane helix predictions

D. Vallenetpresentation

Functional annotation tools

PRIAM : Finding similarities with enzymatic profiles (enzymatic classification) Pathway tools (BioCyc/P; Karp) : Metabolic pathway reconstruction

L. Labarrepresentation

SWISSALL

CDSs traduites= protéome

+BlastPFastA

Pour une séquence peptidiquecomparée, liste des protéinesdes banques les plus “similaires” (= hits blast).

• On opère un transfert par similarité de la fonction biologique présumée (identité > 50% sur une longueur de 80% des séquences).

• On va propager des annotations du type ‘putative kinase’ à d’autres protéines, ressemblant de moins en moins à la première.=> quel est le seuil de ressemblance à partir duquel 2

protéines peuvent avoir la même fonction ?

• Similarité en séquence/similarité en structure ou de la fonction

=> propagation les erreurs d’annotation

• annotations des banques incomplètes/fausses

• “Orphelins”

Recherche de similarités : banques de protéines

Objectif : tenir compte de la modularité des protéines

Banque dedomainesprotéiques

CDSs traduites= protéome

+Programme

“ad-hoc”

Pour une séquence peptidique,caractéristiques des motifs

protéiquesles plus probables

• Domaines répertoriés sous forme de “profiles”• Autant de programmes de recherche que de banques (formats différents)

-> PROSITE, BLOCKS, PRINTS, PFAM, etc.

• Compléments des résultats de BlastP=> éviter une annotation unique dans le cas de

protéines modulaires.

Recherche de similartiés : banques de motifs protéiques

Gène orphelin

Genome A Genome B

Dyn. Prog.

1

1

1

1

2

23

• Comparaison des protéomes de deux génomes A et B.

• Chaque protéine de Gi est alignée avec toutes les protéines de Gj.

Relations :

1 1

«Best Hits Bidirectionnels»

1 n

«Best Hits»

• Une paire d’orthologues vérifie la relation bijective BHB Genes

E. coli/B. subtilis

BHB=1503

4174Genes 4098

36.0% 35.0% S. aureus/B. subtilis

BHB=1552

2593Genes 4098

37.9%59.8%E. coli/Y. pestis

BHB=2402

4174 4017

57.5% 59.8%Y. pestis/Y.pseudotuberculosis

BHB=3518

4017Genes/CDSs 4347

87.6% 80.9%

Exploration des voisinages : caractérisation d’orthologues

Principe :• comparaisons 2 à 2 des protéomes de 70 génomes bactériens• regroupement des gènes orthologues (BBH) : forment une classe fonctionnelle particulière

Un COG = ens. de protéines qui devraient dériver d’une protéine ancestrale commune

http://www.ncbi.nlm.nih.gov/COG/

Groupes de Gènes Orthologues = COG (Koonin)

• SGBD relationnel (MySQL)SGBD relationnel (MySQL)

PkGDB : Procaryotic Genome DataBase

Objectif : données d’annotation ‘propres’, cohérentes, à la source des méthodologies de génomique comparative

• Génomes completsGénomes complets (Refseq NCBI)(Refseq NCBI)

Intégration dans PkGDB

Gestion des ‘frameshifts’

Homogénéité des données

PkGDB

Construction des pré-matrices

(probabilités de transition/

modèle markovien)

Compare_Annotation

Ens. des CDSs ‘valides’

Databank_AnnotationDonnées issues des

banques

CDSs ‘valides’ des banques (1)

• Correction/vérification des CDS à ‘problème’

• Annotation des pseudogènes

PkGDB


banques

Toutes les CDSs :Jeu de CDSs (1)

+CDSs dont les bornes ont été corrigées automatiquement

OU à corriger manuellement

Courbes de probabilité de codage

Fichiers des

banques

Processus d’intégration des données publiques dans PkGDB

Exemple de corrections : annotation des pseudogènes

CDSs ‘fragment’ (type fCDS)

CDS ‘complexe’ (type cCDS)

Error type = ‘No3multiple’

kdpB

kdpC

kdpD kdpE speF

gene 622524..624571 /gene="kdpB" /locus_tag="S0610" /note="frameshift" /pseudo /db_xref="GeneID:1077039" gene 624580..625152 /gene="kdpC" /locus_tag="S0611" CDS 624580..625152 /gene="kdpC" /locus_tag="S0611" /function="enzyme; Transport of

small molecules: Cations" /codon_start=1 /transl_table=11 /product="potassium-transporting

ATPase" gene 625145..627825 /gene="kdpD" /locus_tag="S0612" /note="frameshift" /pseudo gene 627822..628507 /gene="kdpE" /locus_tag="S0613" /note="frameshift" /pseudo gene 629197..631394 /gene="speF" /locus_tag="S0614" /note="frameshift" /pseudo …

PkGDB

Construction des pré-matrices

(probabilités de transition/

modèle markovien)

• Correction/vérification des CDS à ‘problème’

• Annotation des pseudogènes

Compare_Annotation

Ens. des CDSs ‘valides’


banques

PkGDB


banques

CDSs ‘valides’ des banques (1)

Toutes les CDSs :Jeu de CDSs (1)

+CDSs dont les bornes ont été corrigées automatiquement

OU à corriger manuellement

Courbes de probabilité de codage

PkGDB

Compare_AnnotationAnnotations banquesStatut = ‘Checked’


banques

Fichiers des

banques

Processus d’intégration des données publiques dans PkGDB

CDSs corrigées/validées

(2)

AMIMat : construction des

modèles de gènes








Ré-annotation syntaxique

Complétion /correction des données

MICheck : ré-annotation (syntaxique) de génomes bactériensObjectif : Vérifier rapidement si les annotations répertoriées dans les banques de séquences pour un génome donné sont complètes.

Cruveiller et al. (2005) MICheck : A Web tool to fast check annotations of bacterial genomes. Nucleic Acid Research (en révision)

http://www.genoscope.cns.fr/agc/tools/micheck

Fichier EMBL ou GenBankSéq. nucl Annotations

+Modèle(s) de gènes

CDS prédites

Calcul de la probabilitémoyenne de codage

Gènes annotésCOMPARAISONPosition des codons stop

CDS communes CDS UNIQUESBanques

CDS UNIQUESAMIGene

Projets de ré-annotation de génomes bactériens

Base de données CMR (Comprehensive Microbial Resource) du TIGR

Gènes en plus

«Primary annotation» : annotations originales+ « TIGR annotation » : annotations automatiques

(disponibles en consultation uniquement)

Les banques de séquences publiques

NCBI (Genbank) : projet Refseq (Reference Sequence)

Gènes en plus/en moins

Reviewed RefSeq : annotations automatiques + ‘curation’ manuelle par des experts du NCBI.

Provisional RefSeq :

Provisional RefSeq : annotations originales

annotations automatiques uniquement

Genbank‘original’

(BA000002)

Résultats MICheck sur A. pernix (status Reviewed Refseq)

APE1077APE1097

rplXAPE1087a

APE1088a

APE1089

Fichier‘Refseq’

(NC_000854)


CDS UNIQUESAMIGene

BA00000215651569

1835

941186 NC_000854

Résultats MICheck sur O. iheyensis (status Reviewed Refseq)


CDS UNIQUESAMIGene

BA00002834063392

214

1818 NC_004193

Fichier‘Refseq’

(NC_004193)gene complement(2047445..2047618) /gene="OB2021"CDS complement(2047445..2047618) /gene="OB2021" /product="hypothetical protein"gene 2047725..2048765 /gene="OB2022"CDS 2047725..2048765 /gene="OB2022" /EC_number="3.5.1.28" /product="N-acetylmuramoyl-L-alanine

amidase (partial) " /translation="MKLTTLISTIL… "gene complement(2048799..2049245) /gene="OB2023"CDS complement(2048799..2049245) /gene="OB2023"

BA000028 gene complement(2047445..2047618) /locus_tag="OB2021" /db_xref="GeneID:1018510"CDS complement(2047445..2047618) /locus_tag="OB2021" /product="hypothetical protein"misc_feature 2047725..2048765 /note="similar to N-acetylmuramoyl-L-alanine amidase"gene complement(2048799..2049245) /locus_tag="OB2023" /db_xref="GeneID:1018512"CDS complement(2048799..2049245) /locus_tag="OB2023" /note="CDS_ID OB2023

NC_004193

Projets de ré-annotation de génomes bactériens

EBI (EMBL) : projet Genome Reviews

Gènes en moins

Base de données CMR (Comprehensive Microbial Resource) du TIGR

Gènes en plus

«Primary annotation» : annotations originales+ « TIGR annotation » : annotations automatiques

(disponibles en consultation uniquement)

Les banques de séquences publiques

NCBI (Genbank) : projet Refseq (Reference Sequence)

Gènes en plus/en moins

Reviewed RefSeq : annotations automatiques + ‘curation’ manuelle par des experts du NCBI.

Provisional RefSeq :

Provisional RefSeq : annotation originales

annotations automatiques uniquement

Enrichissement/correction des annotations fonctionnelles originales(Données UniProt, Genome Ontology, InterPro, etc)

Standardisation/homogénéisation des annotations originales

Détection et élimination des annotations ‘erronées’ (système Xanthippe)

Résultats MICheck sur S. oneidensis (status Reviewed Refseq)


CDS UNIQUESAMIGene

AE00517641144144

20150

2160 AE005176_GR

FichierGenomeReview

(AE005176_GR)

Genbank‘original’(AE005176)

Fichier d’annotation original et fichier EMBL (GR)

FT CDS 3264761..3266158FT /codon_start=1FT /gene="dctM {UniProt/TrEMBL:Q8ECK2}"FT /locus_tag="SO3136 {UniProt/TrEMBL:Q8ECK2}"FT /product="C4-dicarboxylate transport protein …FT CDS 3268059..3269438FT /codon_start=1FT /gene="dctD {UniProt/TrEMBL:Q8ECK1}"FT /locus_tag="SO3138 {UniProt/TrEMBL:Q8ECK1}"FT /product="C4-dicarboxylate transportFT transcriptional regulatory proteinFT {UniProt/TrEMBL:Q8ECK1} »FT CDS complement(3273023..3273601)FT /codon_start=1FT /gene="tdk {UniProt/Swiss-Prot:Q8ECK0}"FT /locus_tag="SO3140 {UniProt/SwissProt:Q8ECK0}"FT /product="Thymidine kinase {UniProt/Swiss-FT Prot:Q8ECK0}"FT /EC_number="2.7.1.21 {UniProt/Swiss-Prot:Q8…}"FT /function="ATP binding {GO:0005524} »FT /function="thymidine kinase activity {GO:0004797}"FT /biological_process="DNA metabolism FT {GO:0006259}"FT CDS 3276288..3278438FT /codon_start=1FT /gene="dcp-1 {UniProt/TrEMBL:Q8ECJ9}"FT /locus_tag="SO3142 {UniProt/TrEMBL:Q8ECJ9}"FT /product="Peptidyl-dipeptidase Dcp"FT /function="metalloendopeptidase activity FT {GO:0004222}"FT /biological_process="proteolysis and peptidolysisFT {GO:0006508}"

AE005176_GR gene 3266258..3268062 /gene="dctB" /locus_tag="SO3137" /note="This region contains an authentic frame shift and is not the result of a sequencing artifact; C4-dicarboxylate transport sensor protein, authentic frameshift" gene 3268059..3269438 /gene="dctD" /locus_tag="SO3138" CDS 3268059..3269438 /gene="dctD" /locus_tag="SO3138" /note="similar to GB:X14046, SP:P11049, and PID:29794; identified by sequence similarity; putative" /codon_start=1 /transl_table=11 /product="C4-dicarboxylate transport transcriptional regulatory protein" gene complement(3269514..3272585) /locus_tag="SO3139" /note="This region contains an authentic frame shift and is not the result of a sequencing artifact; conserved hypothetical protein; identified by Glimmer2; putative" gene complement(3273023..3273601) /locus_tag="SO3140" CDS complement(3273023..3273601) /locus_tag="SO3140" /note="identified by match to PFAM protein family HMM PF00265" /codon_start=1 /transl_table=11 /protein_id="AAN56142.1" /product="thymidine kinase gene 3274138..3276066 /locus_tag="SO3141" /note="This region contains a gene with one or more premature stops or frameshifts, and is not the result of a sequencing artifact; cytochrome c, degenerate; similar to GP:3628769; identified by sequence similarity; putative" …

AE005176

/note="This region contains an authentic frame shift and is not the result of a sequencing artifact; C4-dicarboxylate transport sensor protein, authentic frameshift"

/note=" This region contains an authentic frame shift and is not the result of a sequencing artifact; … "

/note="This region contains a gene with one or more premature stops or frameshifts, and is not the result of a sequencing artifact; cytochrome c, degenerate; similar to GP:3628769; identified by sequence similarity; putative"








Ré-annotation syntaxique

Complétion /correction des données

• Résultats d’analysesRésultats d’analyses : : Intrinsèques : gènes, signaux, répétitions,…

• Génomes nouveauxGénomes nouveaux (projets d’annotation)(projets d’annotation)

Extrinsèques : Blast, InterPro, COG, synténies …

Bio

log

ical datab

ases

Stratégie générale de l’annotation des génomes bactériens -1-

Séquençage

Prédictionautomatique de gènes

Annotationfonctionnelle (auto)

Prediction of coding regions, promoters, terminators, RNAs

Similarity searches, assignments to protein families, sequence features, …Suggestion of function, classification

Ré-annotation Validation and update of previous annotationsExpression data, knock-out phenotypes, etc.

Annotationmanuelle

Intégration dans d’autresplateformes d’analyse

Validation of automatic annotations,Additional database and literature searches,Contextual analysis, gene fusions, protein interactions,Phylogenetic profiles

Lab work + Bioinformatics

Effortmanuel

Bioinformatics

Bioinformatics

Lab work + Bioinformatics

AUTOMATION needed

VISUALIZATION needed

Bio

log

ical datab

ases

Sequençage

Prédictionautomatique de gènes

Annotationfonctionnelle (auto)

Annotationmanuelle

Ré-annotation

Intégration dans d’autresplateformes d’analyse Bioinformatics

Stratégie générale de l’annotation des génomes bactériens -2-

GRAPHICAL ANNOTATION INTERFACE(Web server connected to the data base)

Validation and completion of the automatic annotation (Re) Annotation using synteny results

Schéma général du système MaGe

PkGDB

AcinetoDB

YersiniaScope

HaloplanktisDB

BacillusScope ColiScope

FrankiaDB

Databases for annotation and re-annotation projects

MySQLDB

Specialized databasesPublic databanks

«Private» sequences

Blast

tRNAscan-SE

InterProScan

PRIAM

COGnitor

TMHMM

Automatic functional assignment combining multiple evidence and

synteny results

«AutoFunc»

/product

/gene

/label CDS name (very different from gene name !) = CENARnumber

Description of the best hit : DA_SWALL OR the one of Monica R. (EcoGene database) IF one E. coli protein is similar to the annotated gene : DA_COLI

Gene name and synonyms from the EcoGene database IF one E. coli protein is similar to the annotated gene.

/function Functionnal Classification (E. coli)

IF identity > 40% AND alignment on 80% of the protein lengths

OR identity > 30% AND alignment on 80% of the protein lengths AND SYNTENY

DA = « Definitive_Annotation »

/EC_number PRIAM EC number(s)

Module d’assignation fonctionnelle automatique (AutoFunc) -1-

IF identity > 40% AND partial alignment PM = Partial_Match

/product Description of the best hit : PM_SWALL OR the one of Monica R. (EcoGene database) IF one E. coli protein is similar to the annotated gene : PM_COLI+ (partial match)

(>80% of the databank protein length)

Query protein

Databank protein

Genomes de Référence : E. coli et Acinetobacter ADP1

IF 30% < identity < 40% AND alignment on 80% of the protein lengths

/product Putative/Probable (?) + description of the best hit PA_SWALL OR the one of E. coli PA_COLI

IF identity < 30% : no significant databank similarity

/product Hypothetical protein / Orphan Protein ?

/note Summary of the 3 SWALL best hits

PA = Putative_Annotation

HP = Hypothetical_Protein

Module d’assignation fonctionnelle automatique (AutoFunc) -2-

IF identity > 40% AND partial alignment FO= Fragment_Of

/product Description of the best hit : PM_SWALL OR the one of Monica R. (EcoGene database) IF one E. coli protein is similar to the annotated gene : PM_COLI+ (partial)

(>80% of the query protein lenght)

Query protein

Databank protein

Annotation définitive : example

2.1.1: DNA replication

Annotation définitive, partial match : example

Ratio of alignment lengths with Lmatch (length of match), Lprot1 (length of protein 1) and Lprot2 (length of protein 2)minL = Lmatch/ min(Lprot1, Lprot2) and maxL = Lmatch /max(Lprot1, Lprot2)

Visualisation sous MaGe de CENAR0426

CENAR0426

Annotation définitive, partial : example

Visualisation sous MaGe de CENAR0361

CENAR0361

Erreur de séquence probable -> il manque le début du gène(mettre CENAR361 à CheckSeq)

« Partial » and « partial match » : other cases

CENAR3149

« partial »

mdoGmdoH

mdoH

31503151 « partial

match» CENAR3156

CENAR3153

CENAR3149/3950 : « CheckSeq »

CENAR3153/56 : Ajuster le codon start

Bacterial genome annotation in the AGC group GENOSCOPE/CNRS UMR Structure et évolution des génomes...

Documents

Transcript of Bacterial genome annotation in the AGC group GENOSCOPE/CNRS UMR Structure et évolution des génomes...