HCG: the database for hierarchical gene...

18
HCG: a database for hierarchical classification of functionally equivalent genes in prokaryotes Fenglou Mao*, Hongwei Wu*, Victor Olman, Ying Xu 1 Computational Systems Biology Laboratory Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics University of Georgia, Athens, GA 30602, USA *These authors contributed equally to this paper 1 Correspondence author Abstract Background: The existing gene annotation schemes generally classify genes into two- levels of parallel and unrelated homologous and/or orthologous gene groups, limiting our capabilities for gene function prediction at higher resolution. While homology and orthology are useful concepts for evolutionary studies of genes, they may not be the most appropriate ones for functional classification of genes, especially at a high-resolution level. Results: We present a new gene annotation database: the h ierarchical c lassification system of g enes (HCG), which provides functional annotation of prokaryotic genes in general at higher resolution than the existing functional classification schemes. The HCG database consists of clusters, hierarchically organized, of functionally equivalent genes at varying levels of resolution. Gene clusters at the top of the HCG hierarchy representing homologous gene groups and descendent gene clusters representing functionally equivalent genes at an increasingly higher resolution going down from the top to the leaf- level clusters along the classification hierarchy. We also provide several examples to demonstrate how HCG can be used to make specific gene function annotation. For each HCG cluster, we provide a p-value assessing the statistical significance in grouping its genes together, based on the functional relationship among its genes and their relationship with genes outside of the cluster. Conclusion: The HCG database, implemented using MySQL, currently consists of 658,174 genes, 51,205 clusters organized into 21,109 trees, from 224 prokaryotic genomes. The on-line database supports four search capabilities, namely (1) browsing HCG classification by trees, (2) browsing HCG classification by organisms, (3) querying 1

Transcript of HCG: the database for hierarchical gene...

Page 1: HCG: the database for hierarchical gene classificationcsbl.bmb.uga.edu/HCG/HCG-database-10-BMC-bioinformatics.pdfHCG classification by trees, (2) browsing HCG classification by organisms,

HCG: a database for hierarchical classification of functionally equivalent genes in prokaryotes

Fenglou Mao*, Hongwei Wu*, Victor Olman, Ying Xu1

Computational Systems Biology Laboratory Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics

University of Georgia, Athens, GA 30602, USA *These authors contributed equally to this paper

1Correspondence author

Abstract Background: The existing gene annotation schemes generally classify genes into two-

levels of parallel and unrelated homologous and/or orthologous gene groups, limiting our

capabilities for gene function prediction at higher resolution. While homology and

orthology are useful concepts for evolutionary studies of genes, they may not be the most

appropriate ones for functional classification of genes, especially at a high-resolution

level.

Results: We present a new gene annotation database: the hierarchical classification

system of genes (HCG), which provides functional annotation of prokaryotic genes in

general at higher resolution than the existing functional classification schemes. The HCG

database consists of clusters, hierarchically organized, of functionally equivalent genes at

varying levels of resolution. Gene clusters at the top of the HCG hierarchy representing

homologous gene groups and descendent gene clusters representing functionally

equivalent genes at an increasingly higher resolution going down from the top to the leaf-

level clusters along the classification hierarchy. We also provide several examples to

demonstrate how HCG can be used to make specific gene function annotation. For each

HCG cluster, we provide a p-value assessing the statistical significance in grouping its

genes together, based on the functional relationship among its genes and their

relationship with genes outside of the cluster.

Conclusion: The HCG database, implemented using MySQL, currently consists of

658,174 genes, 51,205 clusters organized into 21,109 trees, from 224 prokaryotic

genomes. The on-line database supports four search capabilities, namely (1) browsing

HCG classification by trees, (2) browsing HCG classification by organisms, (3) querying

1

Page 2: HCG: the database for hierarchical gene classificationcsbl.bmb.uga.edu/HCG/HCG-database-10-BMC-bioinformatics.pdfHCG classification by trees, (2) browsing HCG classification by organisms,

genes against the HCG database to find its gene cluster at the highest resolution possible

and its parent clusters if any, and (4) annotating sequences provided by a user.

1. Background

With the rapid accumulation of genome sequences along with their genes accurately

predicted, numerous efforts have been devoted to the computer-aided functional

annotation of genes, which have led to the development of a number of functional

classification schemes and associated databases such as Clusters of Orthologous Groups

(COG) [1], Pfam [2], and InterPro [3]. There are also other databases that integrate gene

annotation information with pathway information, such as Kyoto Encyclopedia of Genes

and Genomes (KEGG) [4], BioCyc[5] and the subsystem annotation environment SEED

[6]. While these and other functional classification schemes and databases provide highly

useful information for functional annotation of genomes, they are generally limited to

classification of genes into homologous and/or orthologous gene groups, although

homology and orthology are originally defined from evolution and don’t indicate gene

function relationship. The classification result of such schemes is generally represented as

a collection of parallel and unrelated functionally “equivalent” gene groups, providing a

two-level classification of functionally equivalent genes. We believe that the functional

relationship between genes can be better represented using a hierarchical system, which

is confirmed by recent development of Gene Ontology (GO) [7], which employs a DAG

(Directed Acyclic Graph) structure, more general than a hierarchical structure. Generally

gene function classifications can be grouped into two classes: two-level classification

such as COG, KEGG orthologs and Pfam and multi-level classification such as GOA and

our classification scheme HCG.

The Gene Ontology Annotation (GOA) Database [8] is the only database that

employs multi-level classification of for gene functions up until now. GOA annotates

genes using GO terms so it stands on a solid ground for function classification. However

most annotations in GOA are extracted from UniProt and InterPro by using three scripts

(ec2go, skpw2go and InterPro2go), and others are annotated manually with the help of

annotation tools such as GOAnnotator, thus it is hard to evaluate the annotation quality.

There are other genome databases with gene annotation information, such as the

integrated microbial genomes (IMG) system [9] and Integr8 [10]. While useful, the gene

2

Page 3: HCG: the database for hierarchical gene classificationcsbl.bmb.uga.edu/HCG/HCG-database-10-BMC-bioinformatics.pdfHCG classification by trees, (2) browsing HCG classification by organisms,

annotation in IMG is created through using rather simple methods, namely RPS-BLAST

(reverse position specific BLAST) and bidirectional best hits, which is widely thought to

be inaccurate [11], have low sensitivity [12] and yield high false positive rates [13], and it

also adopts the two level of classification strategies such as Pfam and COG. Integr8 also

used the annotation from other database such as InterPro and Pfam.

We have developed a functional classification scheme for prokaryotic genes,

based on both sequence similarity information and genomic neighborhood information

[14]. A key unique feature of this classification scheme is that it classifies genes into

functionally equivalent clusters at multiple resolution levels, and these clusters are either

parallel-to each other or inside-of one another, hence giving rise to a multi-level

hierarchical structure, under which genes could have “equivalent” functions measured at

varying resolution. For example, genes in any root-level cluster, in this functional

hierarchy, are functionally equivalent in the sense that they are homologous, and genes in

any lower-level cluster represent a group of functionally equivalent genes with higher

specificity (or higher resolution). The functional equivalence relationships among genes

at different resolution are derived based on a two-level classification scheme [14]. The

algorithm first derives the functional relationships among individual gene pairs based on

their sequence similarity and their co-location information in genomes, and then derives

the functional relationships among a group of genes by detecting the groups of genes with

high densities of pair-wise functional relationships within each group versus the

(relatively) lower densities of relationships between each gene group and genes outside of

the group. For each predicted gene cluster (group), we also provide a p-value to measure

how standout the cluster is in the background where these genes sit. In some sense, this

value also reflects the consistency of annotation of gene groups, or called annotation

quality.

By applying this classification scheme to genes of 224 prokaryotic genomes, we

have established a database, HCG, of functionally equivalent gene clusters. Intuitively,

the HCG system can be viewed as a “forest” of trees, where each tree consists of a root-

level cluster and its descendent clusters, possibly at different levels. For each cluster in

the HCG system, we have provided an annotation to characterize the common biological

function of the cluster, based on the Gene Ontology (GO) annotation (GOA Proteome

3

Page 4: HCG: the database for hierarchical gene classificationcsbl.bmb.uga.edu/HCG/HCG-database-10-BMC-bioinformatics.pdfHCG classification by trees, (2) browsing HCG classification by organisms,

Sets) and NCBI gene-product description. Other information such as Pfam and COG

annotation is also provided for cross-reference purposes.

2. Construction and Content

2.1 The Construction of the Database

The HCG database currently consists of the classification result from 224 complete

prokaryotic genomes (released of NCBI, 03/05/2005). While the detailed description of

the clustering algorithm and an analysis of the data has been published elsewhere [14],

we here outline the procedure for database construction and application. The HCG

system has been created using the following steps:

(a) All homologous gene pairs are identified using reciprocal BLASTP [15] with e-

values < 1 for both directions of the search against all the 658,174 genes.

(b) The Smith-Waterman algorithm [16] is performed on all homologous gene pairs

selected from (a) to obtain a multi-value feature vector for each homologous

gene pair, representing the quality of their sequence alignment.

(c) A positive training set consisting of orthologous gene pairs as well as a negative

training set consisting of homologous but non-orthologous gene pairs is created

for the purpose of training a classifier (see [14] for details) .

(d) A parameterized linear classification function is employed to discriminate

orthologous genes from homologous but non-orthologous genes, whose

parameters are selected so that the classification function optimally

discriminates the positive from the negative training data.

(e) A scoring scheme is developed to measure the functional equivalence between

two genes based on the sequence similarity information derived from (d) and

genomic neighborhood information derived based on three operon prediction

programs, namely (i) VIMSS [17], (ii) JPOP [18, 19], and (iii) GeneChords [20].

(f) A graph representation is constructed to represent all the 658,174 genes from

224 prokaryotic genomes and their functional equivalence relationship defined

in (e).

4

Page 5: HCG: the database for hierarchical gene classificationcsbl.bmb.uga.edu/HCG/HCG-database-10-BMC-bioinformatics.pdfHCG classification by trees, (2) browsing HCG classification by organisms,

(g) A graph-partition algorithm is applied to the representing graph of these genes

and their functional relationships to generate a collection of dense sub-graphs

(and sub-sub-graphs, etc), each of which represents a gene cluster. These gene

clusters form a hierarchical structure. For each cluster, a p-value is calculated to

assess its statistical significance.

(h) Each gene cluster is annotated using a set of keywords and GO terms, based on

common features of the NCBI and GO annotations [10] of individual genes of

the cluster, where the keywords are extracted from the NCBI description of each

gene product, and the GO terms for each cluster are selected based on a

majority-rule vote among GO assignments to individual genes in the cluster.

(i) All gene-classification data is integrated into a MySQL database; and a web

server is created at http://csbl.bmb.uga.edu/HCG to facilitate searching and

accessing the database.

The validity of the predicted gene clusters are checked through comparing the HCG

classification against the genome taxonomy, COG classification [1] and Pfam

classification [2] of genes. The detailed validation procedure and results are given in [14].

2.2 Database Tables

To store the tree structure of the HCG system in a MySQL relational database, we have

designed two tables, Node and Edge shown in Figure 1, to represent the HCG clusters

and the parent-child relationship. Other information such as gene attributes, cluster

annotation, and the p-values of each cluster are also stored in the MySQL tables. Figure 1

shows the relationship among the tables. The table “Gene” is used to store the

information of individual genes, such as gene attributes. The tables “GO”, “Node_GO”

and “Gene_GO” are used to store GO terms, GO annotation for individual genes and GO

term-based annotation for individual clusters, respectively. The table “Gene_Node” is

used to store the genes in each cluster, and the table “Species” is used to store species

information of a genome. There are several additional internal tables that are not

described in Figure 1 and are omitted for further discussion.

2.3 Information Available at HCG

5

Page 6: HCG: the database for hierarchical gene classificationcsbl.bmb.uga.edu/HCG/HCG-database-10-BMC-bioinformatics.pdfHCG classification by trees, (2) browsing HCG classification by organisms,

HCG stores and facilitates accessing the basic information about each gene in its database,

including a gene’s position in a genome, PID, locus tag, chain ID, COG number, gene

product description, gene name, sequence, etc, all extracted from the NCBI database. In

addition, we have run COGNITOR [21] to generate the COG numbers for all genes,

including both functionally assigned and unassigned by the NCBI database. So for the

vast majority of the genes in HCG, we have COG numbers. We have also integrated the

GO annotations and Pfam accession ID into the HCG database in a similar fashion.

In addition to the information extracted from other data sources, HCG has a large

quantity of its own data. At the highest level, HCG is a forest of trees, each being a

collection of gene clusters that are either parallel-to or part-of each other. At the top-

level of each tree is a cluster containing all genes in the tree, which are homologous to

each other. Each lower-level cluster consists of genes that are functionally more

equivalent than the genes in the parent cluster. For each cluster, we have calculated a p-

value to estimate the statistical significance of having the genes in this cluster forming an

outstanding cluster in the background of other genes [14].

For each gene cluster, we assign its functional annotation using two methods.

First, we assign GO terms to each cluster based on a majority-rule vote using the GO

annotations of individual genes in the cluster [14]. For each HCG cluster, some

individual genes have been annotated by GOA, one or more consensus GO terms are

generated and the consensus GO terms are used to annotate the cluster. A probability

value is calculated for each of the consensus GO terms, which can be used to assess the

reliability of each function assignment – the higher the probability, the higher the

prediction reliability. We have also assigned text descriptions to each gene cluster, which

are derived from the NCBI gene product descriptions of individual genes, and used to

describe the overall function of the cluster. For each cluster, we calculate a consistency

score between 0 and 1, measuring the consistency among the NCBI descriptions for the

individual genes of the cluster, with 1 representing the most consistent and 0 representing

the least consistent. A detailed description of the algorithm is given in [14]. A user can

use both the cluster GO annotation and the text description to infer the function of genes

assigned to each cluster.

6

Page 7: HCG: the database for hierarchical gene classificationcsbl.bmb.uga.edu/HCG/HCG-database-10-BMC-bioinformatics.pdfHCG classification by trees, (2) browsing HCG classification by organisms,

2.4 HCG Data Statistics

The HCG database consists of 658,174 genes from 224 genomes, including 376 DNA

chains (both chromosomes and plasmids) from NCBI (release of 03/05/2005). Among the

658,174 genes, 609,887 genes are assigned with HCG codes. 139,495 genes have COG

numbers extracted from the NCBI database, and 459,955 genes are assigned with COG

numbers by running COGNITOR [21]. When comparing the COGNITOR-calculated

COG numbers with the NCBI-assigned COG numbers, we have noticed that only

108,620 genes have the same COG numbers, and other 30,875 genes have different COG

numbers. This inconsistency most likely comes from the multiple COG numbers returned

by COGNITOR. 318,326 genes have been assigned with GO terms in [10].

HCG has 51,205 clusters of genes (they are numbered consecutively in an arbitrary

manner so are the sub-clusters and sub-sub-clusters, etc), organized into 21,109 HCG

trees. Among these trees, 2,092 trees have more than 50 genes, totaling 518,703 genes.

10,716 trees are annotated with text descriptions, covering 568,717 genes. 4,877 trees are

annotated with cluster GO terms, covering 500,996 genes. 4,330 trees have both cluster

GO terms and text descriptions, covering 497,350 genes. 182,670 genes that are not

annotated in Integr8 [10] are successfully annotated by HCG; and for those genes that are

annotated by both HCG and Integr8, most of them are annotated with more specific GO

terms in HCG than in Integr8. By combining both the text description and cluster GO

annotation, a clear function description of each gene can be inferred.

The HCG database is implemented using MySQL 4.0.18, running on a SuSE 9.0

linux computer with 4GB memory and two 2.8GHz XEON processors. A web interface,

which is hosted by an Apache 2.0.40 web server, is developed to facilitate access to the

database through the Internet. PHP server-side script language is used to create dynamic

web pages. The response time for browsing most pages of the HCG database server is

less than one second, while the response time of the “query” page depends on the

complexity of the query, which is typically within a couple of seconds.

3. Utility and Discussion

3.1 Web Access

7

Page 8: HCG: the database for hierarchical gene classificationcsbl.bmb.uga.edu/HCG/HCG-database-10-BMC-bioinformatics.pdfHCG classification by trees, (2) browsing HCG classification by organisms,

The HCG database can be accessed at http://csbl.bmb.uga.edu/HCG. A user can retrieve

data using one of the following four methods. The first one is to browse HCG in a

hierarchical way. The user can start from the virtual root of the “forest” to list all the trees.

From this list, the user can select a tree that he/she may want to browse, and then go to its

off-springs. The second method is to browse the gene annotation for each species. The

user can select a specific species and a chain, and browse the HCG annotation page by

page. The third method is to search the HCG database for genes using keywords selected

from a pre-prepared list of fields. The user can specify the value of any gene attribute,

such as the words in the product description, the HCG number of the genes, or a species

name, etc. The user can also create a combination of these conditions by using “AND”

and “OR”. In the fourth method, the user can submit his/her own protein sequence to the

server to find the related HCG ids, and then annotate the sequence using the GO numbers,

text descriptions associated with the returned HCG id. Figure 2 shows a workflow for

page browsing and a few screen shots of using HCG.

3.2 Gene Annotation at Multiple Resolutions by HCG

As discussed in [14], the multi-level classification scheme provides substantially more

information than the one- or two-level classification schemes such as COG [1] and

Pfam[2],.

Figure 3 shows the structure of the HCG tree rooted at cluster “HCG-21” and its

descendent clusters. Among the 1,294 genes included in cluster HCG-21, 1,089 genes are

assigned with GO terms; and 98.3% and 97.6% of the 1,089 genes are annotated as

GO:0000155 (two-component sensor activity) and GO:0005524 (ATP binding activity),

respectively. Hence the biological functions of the HCG-21 genes can be summarized

using GO:0000155 and GO:0005524; and those HCG-21 genes without an identified

biological function are predicted to have the biological functions defined by the cluster,

i.e., GO:0000155 and GO:0005524.

Comparing to these GO annotations assigned to the root-level cluster, the hierarchical

structure of HCG-21 provides much richer functional information to genes in the lower-

level sub-clusters of this cluster. For example, a large portion of genes in HCG-21 are

further partitioned into 38 child-level clusters labeled as “HCG-21.0” to “HCG-21.37”.

8

Page 9: HCG: the database for hierarchical gene classificationcsbl.bmb.uga.edu/HCG/HCG-database-10-BMC-bioinformatics.pdfHCG classification by trees, (2) browsing HCG classification by organisms,

The numbers of genes in these clusters range from 3 to 91. Almost all of these child-level

sub-clusters are annotated with more specific functions, using GO terms and NCBI-based

text description than their parent cluster “HCG-21”.

As we demonstrate using the following examples, genes in the same child cluster do

have stronger functional relationship than the relationship among genes in the parent

cluster. Cluster “HCG-21.0” contains 91 kdpD genes, all of which are the sensor genes

for high-affinity potassium transport system; and cluster “HCG-21.4” contains 46 phoR

genes, which are all the sensor genes in the phosphate regulons. Some of the other child-

level clusters each contain genes of similar but distinct biological functions, which are

then further divided into a group of grandchild-level sub-clusters containing genes with

equivalent functions with higher resolution. For example, the cluster “HCG-21.3”

contains 49 genes annotated as either “cpxA” (the envelope stress sensor genes) or “envZ”

(the osmolarity sensor genes). In its child level, the genes of “HCG-21.3” are further

grouped into two smaller sub-clusters, “HCG-21.3.0” and “HCG-21.3.1”, which contains

“cpxA” and “envZ” genes, respectively. The fact that these “cpxA” and “envZ” genes are

grouped in the same cluster “HCG-21.3” suggests that the cpxA and envZ genes are more

equivalent to each other than they are to other genes, which is supported by their NCBI

annotation, where both “cpxA” and “envZ” genes are annotated to sense the extracellular

pressure, and “envZ” genes are to sense the pressure from water (i.e., osmolarity). Similar

can be said about another child-level cluster “HCG-21.2”, which contains 52 genes

annotated as either “vanS” or “resE”. In the grandchild level, these “vanS” and “resE”

genes are further grouped into two smaller clusters, “HCG-21.2.0” and “HCG-21.2.1”,

which contains “vanS” and “resE” genes, respectively. Among the 1,294 HCG-21 genes,

689 cannot be further grouped into lower-level clusters, suggesting that these genes can

only be annotated at low resolution, i.e., “two-component sensor activity” and “ATP

binding activity”, because of the high functional diversity of these genes.

Interestingly while the annotation derived from NCBI descriptions match well with

our gene clusters, the GO annotations we derived from the GOA database are not as

specific. For example, most genes in cluster “HCG-21” are assigned with two GO terms:

GO:0000155 (two-component sensor activity) and GO:0005524 (ATP binding activity),

so we cannot make any specific GO assignment for any of the offspring clusters of

9

Page 10: HCG: the database for hierarchical gene classificationcsbl.bmb.uga.edu/HCG/HCG-database-10-BMC-bioinformatics.pdfHCG classification by trees, (2) browsing HCG classification by organisms,

“HCG-21”. However since we have used different information sources in our

gene/cluster annotation, we have achieved annotations with higher specificity. This also

indicates that to get more specific gene function annotation, one should look at more

information sources. It should be noted that though our GO-based and NCBI-based

annotations do not have any conflict, in general GOA-based annotation is not as specific

as the NCBI-based ones.

3.3 Application Examples

We now illustrate how to use the HCG database and demonstrate the power of the HCG

system for functional prediction of genes, using the following examples.

Example 1: find the function of a gene. Suppose we want to find out the function of

gene “GI-16801886” of Listeria innocua Clip11262. The gene product is labeled as a

“hypothetical protein” in the NCBI database. The COG number of this gene is COG0745,

which represents the gene class of “response regulators consisting of a CheY-like

receiver domain and a winged-helix DNA-binding domain”. Clearly, this annotation is

not particularly useful as there are 3,119 genes assigned with this COG number across the

224 genomes covered by HCG. The GO annotation of this gene is GO:0000156 (two-

component response regulator activity) and GO:0003677 (DNA binding), which is not

very specific either as 3,866 genes in HCG are annotated with both GO terms. To use

the HCG system to derive more specific functional information of this gene, a user can

use the following steps.

1) Go to the HCG main page at http://csbl.bmb.uga.edu/HCG, and then click the link

“Search” to bring up the “Query Builder” page.

2) Fill the query information with “GI == 16801886”, and leave the other entries

blank. Then click “Submit” to query the database.

3) The search will return the HCG code of gene 16801886 as “10.3.0” in the result

page. Then click the link “10.3.0” to bring up the annotation page for this HCG

cluster.

4) In the annotation page of “10.3.0”, the gene name for “10.3.0” is “kdpE”, and

there are also two descriptions about the specific function of “10.3.0”: i) “kdp

10

Page 11: HCG: the database for hierarchical gene classificationcsbl.bmb.uga.edu/HCG/HCG-database-10-BMC-bioinformatics.pdfHCG classification by trees, (2) browsing HCG classification by organisms,

operon transcriptional regulatory protein kdpE”; ii) “two-component regulatory

protein response regulator kdpE”; iii) putative turgor pressure regulator; iv)

probable transcriptional regulator. The first two annotations indicate more specific

function while the other two indicate a general function. HCG has extracted all

four descriptions because the score for all of them are above our threshold.

Clearly HCG provides much more specific functional information about this gene than

the other functional classification databases.

Example 2: find a gene which carries a specific function. Suppose we want to find out

which gene encodes the protein “bioA” in Vibrio fischeri ES114, an important gene in

biotin synthesis. We know the another name of bioA is “7,8-diaminopelargonic acid

synthetase”. To find “bioA” in “Vibrio fischeri ES114”, the user needs to do the following.

1) First we need to find out which HCG cluster represents “bioA” genes. To do this,

go to the HCG main page at http://csbl.bmb.uga.edu/HCG, and then click the link

“Search” to bring up the “Query Builder” page.

2) Set the query information with “(Gene == bioA) OR (Product include 7,8-

diaminopelargonic)”, and leave the other entries blank. Then click “Submit” to

query the database. To construct the query, the user needs to click the checkbox

corresponding “(“ and “)” in condition1 and condition2. It also needs to click the

radio button corresponding “Or” in condition2.

3) The user should now see the returned gene cluster labeled as cluster “69”, and

some of its genes further clustered into “69.1”, “69.4”, “69.5” and “69.8”, etc.

Many of these genes are annotated as “adenosylmethionine-8-amino-7-

oxononanoate (KAPA) aminotransferase”. It should be noted that the reactant of

“7,8-diaminopelargonic acid”(DAPA) synthesis reaction is “7-keto-8-

aminopelargonic acid” (another name of KAPA). Since these genes are from

several different bacterial genomes, one needs to find the gene in the right

genome. The user should click the link “69” to bring up the annotation page of its

HCG annotation.

4) By checking the annotation pages of HCG cluster “69” and some annotation

pages of its children like “69.1”, “69.4”, “69.5” and “69.8”, the user can see that

11

Page 12: HCG: the database for hierarchical gene classificationcsbl.bmb.uga.edu/HCG/HCG-database-10-BMC-bioinformatics.pdfHCG classification by trees, (2) browsing HCG classification by organisms,

the children clusters are annotated as “bioA”. By checking the genes in the

children clusters, the user should be able to see why they are further clustered;

that is because the genes in same cluster belong to closer species.

5) Therefore one can determine that some children clusters of “69” are related to

“bioA”, and their parent cluster “69” might include “bioA” homologs. Now the

user should go back to the “Query Builder” page at

http://192.168.0.3/HCG/query_builder.php, and enter the query “(HCG

Begin_With 69.) AND (Species_Name include Vibrio fischeri ES114)”, and

submit.

6) In the result page, the user should be able to see three genes NCBI:59712891,

NCBI:59713931 and NCBI:59714306 in cluster “69”. Their HCG codes are

“69.2”, “69.1.0.0” and “69.6”, respectively. By checking the annotation of these

three HCG clusters, only “69.1.0.0” is annotated as “bioA”, the user should be

able to confidently conclude that gene NCBI:59713931 encodes the “bioA” in

Vibrio fischeri ES114, and its enzyme name is either “7,8-diaminopelargonic

acid(DAPA) synthetase” or “adenosylmethionine-8-amino-7-

oxononanoate(KAPA) aminotransferase”.

Example 3: annotate the function of new genes from a newly genome. Two new

cyanobacterial genomes have been recently sequenced by Grossman’s lab (personal

communication), and these genomes are not included in current release of HCG. Here we

use gene NCBI:86604767 as an example to illustrate how to use HCG to annotate the

function of a new gene.

1) Go to the HCG main page at http://csbl.bmb.uga.edu/HCG, and then click the link

“MyHCG” to bring up the sequence input page;

2) Enter the sequence of gene NCBI: 86604767, and click “submit”;

3) HCG returns 10 genes in the database as hits with cluster “6514.” ranked as the

No. 1 hit.

4) Click the link “6514.” to open the annotation page for this cluster, and we found

that the description is “photosystem I subunit XI” and the gene name is “psaL”;

12

Page 13: HCG: the database for hierarchical gene classificationcsbl.bmb.uga.edu/HCG/HCG-database-10-BMC-bioinformatics.pdfHCG classification by trees, (2) browsing HCG classification by organisms,

5) The user can also click the link “Display Hit Genes” to display all the hit genes;

and the descriptions for these genes are “photosystem I subunit XI” or

“photosystem I reaction center subunit XI”;

6) Both the function information obtained from 4) and 5) can be used to annotate the

gene: NCBI:86604767.

A user can also send the sequence to COGNITOR. For this example, it returned “NO

related COG”, suggesting that COG does have its annotation. We have also sent the

sequence to the Pfam server, which returned “PF02605”, representing “Photosystem I

reaction centre subunit XI”, which is consistent with the HCG annotation. We noted that

KEGG doesn’t allow such data retrieval.

4. Conclusion

We have developed a database, HCG, for hierarchical classification of functionally

equivalent genes, which can be used to annotate genes at multiple resolution, depending

on the availability of related data. The HCG system is based on a new method for

prediction of functional relationship through combining information of sequence

similarity and genomic context. The hierarchical organization of genes, grouped together

with other functionally equivalent genes, facilitates functional annotations of new genes

with higher accuracy compared to other functional classification schemes. We plan to

extend this system to include all complete prokaryotic genomes, in the very near future,

and update it on regular basis (monthly). We expect that this new system for gene

annotation will provide a powerful tool for genome analysis and annotation to the

biological community.

Availability and requirements

The database can be accessed at http://csbl.bmb.uga.edu/HCG, the users who want to

analysis the whole database can download the classification data at

http://csbl.bmb.uga.edu/HCG/HCG.tar.gz. The database is freely available for academic

users; non-academic users should contact the corresponding author to obtain a license.

Any modern Internet Browser should be capable of using the online database server.

13

Page 14: HCG: the database for hierarchical gene classificationcsbl.bmb.uga.edu/HCG/HCG-database-10-BMC-bioinformatics.pdfHCG classification by trees, (2) browsing HCG classification by organisms,

Authors' contributions

Fenglou Mao designed the database and implemented the online server; Fenglou Mao and

Hongwei Wu worked together to generate the data of HCG; Victor Olman designed the

hierarchical clustering program; Ying Xu coordinated the whole procedure and provided

the financial support.

Acknowledgement

This work was supported in part by National Science Foundation (NSF/DBI-0354771,

NSF/ITR-IIS-0407204, NSF/DBI-0542119) and by a “Distinguished Scholar” grant from

the Georgia Cancer Coalition.

Reference

1. Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science 1997, 278(5338):631-637.

2. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R et al: Pfam: clans, web tools and services. Nucleic Acids Res 2006, 34(Database issue):D247-251.

3. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, Cerutti L et al: InterPro, progress and status in 2005. Nucleic Acids Res 2005, 33(Database issue):D201-205.

4. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res 2004, 32(Database issue):D277-280.

5. Keseler IM, Collado-Vides J, Gama-Castro S, Ingraham J, Paley S, Paulsen IT, Peralta-Gil M, Karp PD: EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res 2005, 33(Database issue):D334-337.

6. Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crecy-Lagard V, Diaz N, Disz T, Edwards R et al: The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res 2005, 33(17):5691-5702. Print 2005.

7. Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C et al: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 2004, 32(Database issue):D258-261.

8. Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 2004, 32(Database issue):D262-266.

9. Markowitz VM, Korzeniewski F, Palaniappan K, Szeto E, Werner G, Padki A, Zhao X, Dubchak I, Hugenholtz P, Anderson I et al: The integrated microbial genomes (IMG) system. Nucleic Acids Res 2006, 34(Database issue):D344-348.

14

Page 15: HCG: the database for hierarchical gene classificationcsbl.bmb.uga.edu/HCG/HCG-database-10-BMC-bioinformatics.pdfHCG classification by trees, (2) browsing HCG classification by organisms,

10. Kersey P, Bower L, Morris L, Horne A, Petryszak R, Kanz C, Kanapin A, Das U, Michoud K, Phan I et al: Integr8 and Genome Reviews: integrated views of complete genomes and proteomes. Nucleic Acids Res 2005, 33(Database issue):D297-302.

11. Fulton DL, Li YY, Laird MR, Horsman BG, Roche FM, Brinkman FS: Improving the specificity of high-throughput ortholog prediction. BMC Bioinformatics 2006, 7:270.

12. Wall DP, Fraser HB, Hirsh AE: Detecting putative orthologs. Bioinformatics 2003, 19(13):1710-1711.

13. Mao F, Su Z, Olman V, Dam P, Liu Z, Xu Y: Mapping of orthologous genes in the context of biological pathways: An application of integer programming. Proc Natl Acad Sci U S A 2006, 103(1):129-134.

14. Wu H, Mao F, Olman V, Xu Y: Hierarchical Classification of Functionally Equivalent Genes of Prokaryotes. accepted by Nucleic Acids Research 2007, 0(0):0.

15. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389-3402.

16. Smith TF, Waterman MS: Comparison of biosequences. Advances in Applied Mathematics 1981, 2(4):482-489.

17. Price MN, Huang KH, Alm EJ, Arkin AP: A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res 2005, 33(3):880-892. Print 2005.

18. Chen X, Su Z, Dam P, Palenik B, Xu Y, Jiang T: Operon prediction by comparative genomics: an application to the Synechococcus sp. WH8102 genome. Nucleic Acids Res 2004, 32(7):2147-2157.

19. Chen X, Su Z, Xu Y, Jiang T: Computational Prediction of Operons in Synechococcus sp WH8102. Proceedings of 15th International Conference on Genome Informatics 2004:211-222.

20. Zheng Y, Anton BP, Roberts RJ, Kasif S: Phylogenetic detection of conserved gene clusters in microbial genomes. BMC Bioinformatics 2005, 6:243.

21. Tatusov RL, Galperin MY, Natale DA, Koonin EV: The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res 2000, 28(1):33-36.

15

Page 16: HCG: the database for hierarchical gene classificationcsbl.bmb.uga.edu/HCG/HCG-database-10-BMC-bioinformatics.pdfHCG classification by trees, (2) browsing HCG classification by organisms,

Figure 1: HCG database table relationship

16

Page 17: HCG: the database for hierarchical gene classificationcsbl.bmb.uga.edu/HCG/HCG-database-10-BMC-bioinformatics.pdfHCG classification by trees, (2) browsing HCG classification by organisms,

Figure 2: A screenshot of the HCG browser

17 17

Page 18: HCG: the database for hierarchical gene classificationcsbl.bmb.uga.edu/HCG/HCG-database-10-BMC-bioinformatics.pdfHCG classification by trees, (2) browsing HCG classification by organisms,

Figure 3: The tree structure of cluster HCG-21, consisting of a group of two-component sensors. A circle represents a cluster which cannot be further divided; a rectangle represents a cluster containing only genes from the same genome; a triangle represents a cluster that does not have genes from the same genome. Colors do not have any particular meaning here.

18