Putative homeodomain proteins identified in prokaryotes based on pattern and sequence similarity

4
Putative homeodomain proteins identified in prokaryotes based on pattern and sequence similarity Shashi Kant, a Ashima Bagaria, b and S. Ramakumar b,c, * a School of Biotechnology, Madurai Kamaraj University, Madurai 625021, India b Department of Physics, Indian Institute of Science, Bangalore 560012, India c Bioinformatics Center, Indian Institute of Science, Bangalore 560012, India Received 13 October 2002 Abstract A putative homeodomain has been identified in eubacterial genomes, which include several pathogens. The domain is related in sequence to homeodomain, a component specific to transcription factors and playing a very important role in eukaryotes such as controlling the developmental processes of the organism. The putative homeodomain has been characterized utilizing the eukaryotic homeodomain protein sequence signature present in PROSITE as well as the sequence similarity search using BLAST suite for different eubacterial genomes. These findings provide evidence for the occurrence of DNA-binding motif in prokarya similar to that in eukarya. Ó 2002 Elsevier Science (USA). All rights reserved. Keywords: Transcription factors; Homeodomain; Sequence pattern; Homeobox; DNA-binding domain; Prokaryote; Eukaryote; Pro-homeodomain Homeotic genesthe master control genes that reg- ulate developmental process [1] in a precise spatial and temporal fashion [2] in higher organisms, share a com- mon sequence element, known as homeobox. The ho- meobox is an important member of homeotic genes and a number of these genes appear to have retained both their precisely ordered tandem arrangement in the ge- nome as well as their developmental roles in axial pat- terning across vast evolutionary time [2]. It encodes a self-folding, stable protein domain of about 60 amino acids, the homeodomain, which is composed of three helical regions representing the sequence specific DNA- binding domain of much larger transcription factor proteins [2,3]. Homeodomain was first identified in Drosophila ho- meotic loci, where the proteins play a role in determi- nation of body plan [4]. The sequences related to homeodomain are found in several other organisms, but while moving from the original Drosophila homeodo- mains to mammalian homeodomains, the relationship between conserved regions drops significantly [5]. However, some residues in the sequence are almost conserved, leading to similar structure and function of homeodomain from different organisms [6]. It is a DNA- binding domain with three helices, where the first and the second helices are almost antiparallel to each other and the third helix is almost perpendicular to the other two [7]. The second and the third helices form a helix- turn-helix motif [8]. Since the past few years a large number of homeobox genes from taxonomic groups ranging from yeast to human have been isolated. A vast amount of sequence data on homeodomains has been accumulated, which provides useful and important information about the evolution of the homeobox gene family and the phy- logeny of eukaryotic organisms [9,10]. On the other hand, a large number of gene sequences either from fully or partially sequenced genomes are available in the public domain. These include genomes from several pathogens as well. However not much is known as to whether proteins similar in sequence and structure to eukaryotic homeodomains occur in prokaryotes also. An attempt to answer this question utilizing currently available bioinformatics tools such as BLAST series of software [11], FASTA [12], CLUSTALW [13], SCAN Biochemical and Biophysical Research Communications 299 (2002) 229–232 www.academicpress.com BBRC * Corresponding author. Fax: +91-80-360-2602/91-80-334-1683. E-mail address: [email protected] (S. Ramakumar). 0006-291X/02/$ - see front matter Ó 2002 Elsevier Science (USA). All rights reserved. PII:S0006-291X(02)02607-4

Transcript of Putative homeodomain proteins identified in prokaryotes based on pattern and sequence similarity

Putative homeodomain proteins identified in prokaryotes basedon pattern and sequence similarity

Shashi Kant,a Ashima Bagaria,b and S. Ramakumarb,c,*

a School of Biotechnology, Madurai Kamaraj University, Madurai 625021, Indiab Department of Physics, Indian Institute of Science, Bangalore 560012, Indiac Bioinformatics Center, Indian Institute of Science, Bangalore 560012, India

Received 13 October 2002

Abstract

A putative homeodomain has been identified in eubacterial genomes, which include several pathogens. The domain is related in

sequence to homeodomain, a component specific to transcription factors and playing a very important role in eukaryotes such as

controlling the developmental processes of the organism. The putative homeodomain has been characterized utilizing the eukaryotic

homeodomain protein sequence signature present in PROSITE as well as the sequence similarity search using BLAST suite for

different eubacterial genomes. These findings provide evidence for the occurrence of DNA-binding motif in prokarya similar to that

in eukarya.

� 2002 Elsevier Science (USA). All rights reserved.

Keywords: Transcription factors; Homeodomain; Sequence pattern; Homeobox; DNA-binding domain; Prokaryote; Eukaryote; Pro-homeodomain

Homeotic genes—the master control genes that reg-

ulate developmental process [1] in a precise spatial and

temporal fashion [2] in higher organisms, share a com-

mon sequence element, known as homeobox. The ho-

meobox is an important member of homeotic genes anda number of these genes appear to have retained both

their precisely ordered tandem arrangement in the ge-

nome as well as their developmental roles in axial pat-

terning across vast evolutionary time [2]. It encodes a

self-folding, stable protein domain of about 60 amino

acids, the homeodomain, which is composed of three

helical regions representing the sequence specific DNA-

binding domain of much larger transcription factorproteins [2,3].

Homeodomain was first identified in Drosophila ho-

meotic loci, where the proteins play a role in determi-

nation of body plan [4]. The sequences related to

homeodomain are found in several other organisms, but

while moving from the original Drosophila homeodo-

mains to mammalian homeodomains, the relationship

between conserved regions drops significantly [5].

However, some residues in the sequence are almost

conserved, leading to similar structure and function of

homeodomain from different organisms [6]. It is a DNA-

binding domain with three helices, where the first and

the second helices are almost antiparallel to each otherand the third helix is almost perpendicular to the other

two [7]. The second and the third helices form a helix-

turn-helix motif [8].

Since the past few years a large number of homeobox

genes from taxonomic groups ranging from yeast to

human have been isolated. A vast amount of sequence

data on homeodomains has been accumulated, which

provides useful and important information about theevolution of the homeobox gene family and the phy-

logeny of eukaryotic organisms [9,10]. On the other

hand, a large number of gene sequences either from fully

or partially sequenced genomes are available in the

public domain. These include genomes from several

pathogens as well. However not much is known as to

whether proteins similar in sequence and structure to

eukaryotic homeodomains occur in prokaryotes also.An attempt to answer this question utilizing currently

available bioinformatics tools such as BLAST series of

software [11], FASTA [12], CLUSTALW [13], SCAN

Biochemical and Biophysical Research Communications 299 (2002) 229–232

www.academicpress.com

BBRC

* Corresponding author. Fax: +91-80-360-2602/91-80-334-1683.

E-mail address: [email protected] (S. Ramakumar).

0006-291X/02/$ - see front matter � 2002 Elsevier Science (USA). All rights reserved.

PII: S0006 -291X(02 )02607 -4

PROSITE [14], and PHD [15] forms the basis of thepresent investigation.

Materials and methods

PROSITE [14] database has two homeobox domain signatures,

PS00027 and PS50071. The signature PS00027 includes the second and

the third helices, and has the pattern [LIVMFYG]-[ASLVR]-X(2)-

[LIVMSTACN]-X-[LIVM]-X(4)-[LIV]-[RKNQESTAIY]-[LIVFSTN-

KH]-W-[FYVC]-X-[NDQTAH]-X(5)-[RKNAIMW].

Analysis by different multiple alignment methods such as CLU-

STALW of more than 1100 homeodomain sequences (http://ge-

nome.nhgri.gov/homeodomain/fasta/? domain ¼ 1) revealed the

presence of conserved residues in the first helix also (e.g., Leu16 and

Phe/Leu20). On the basis of this finding we extended the homeodomain

signature to [LMFAC]-X(3)-[FYILW]-X(3,8)-[PLVIMSQKRDE]-

X(4)-[KRMILWQA]-X(2)-[LVIYMFHK]-[ASDTVFLRH]-X(2)-[LT-

IVMFAC]-X-[LMIAVNY]-X(4)-[LVCI]-X-[LIVSKARNTHG]-W-[F-

YTVELI]-X-[NIAHDGKQ]-X-[RAPLSNK]-X(3)-[RGYKLWNAIV]

that included the first helix as well and performed pattern search in

SWISSPROT and TrEMBL [16] through www.expasy.ch. The bigger

signature was found to give almost the same number of hits as ob-

tained using the PROSITE signature PS00027 (there were 869 hits in

865 sequences from 783 entries in our case and 869 hits in 867 se-

quences from 780 entries in the case of the PROSITE signature). Some

hypothetical proteins in TrEMBL databank, such as Q09546 from

Caenorhabditis elegans, also had sequences compatible with the ex-

tended sequence signature.

To further explore the sequence space, we used the BLAST series of

programs that are widely accessed for searching protein and DNA

databases for sequence similarities. In particular, Position Specific It-

erated BLAST (PSI-BLAST) program is useful for identifying distant

relationships among organisms by finding protein families. In the

present context we carried out the genome specific BLAST [11] search

using one of these hypothetical proteins Q09546, which gave a number

of hits in human as well as in other organisms like Candida albicans

and Aspergillus sp. Surprisingly, FASTA3 [12] and PSI-BLAST of

Homo sapiens gene sequence NM_002586(pre-B-cell leukemia tran-

scription factor 2) showed a few hits in eubacteria Staphylococcus

aureus Q932B9 (SWALL code) compatible with the extended homeo-

domain sequence pattern. The sequence Q932B9 has been annotated in

the TrEMBL [16] database as a hypothetical protein containing the

helix-turn-helix motif.

A sequence search was then carried out with BLAST suite using

Q932B9 (ref — NP_371375.1 — NC_002758.1). This search detected a

number of hits in different species of bacteria (Table 1) like Staphy-

lococcus aureus, Clostridium difficile, and Neisseria gonorrhoeae (Fig. 1)

which are prokaryotes. Multiple sequence alignment (Fig. 1) of these

sequences using CLUSTALW program [13] revealed a sequence pat-

tern that was similar to the extended homeodomain signature.

In order to further characterize the prokaryotic sequences, their

secondary structure was predicted using the PHD [15] server which

indicated the presence of three consecutive helical segments interrupted

by non-helical regions containing N/G sequence pattern at the 22nd

and 33rd positions (Fig. 1). Hydropathy analysis of the helical seg-

ments using the HELICALWHEEL module of Wisconsin GCG

package [17] identified the helices to be amphipathic (Fig. 2). Attempts

were made to predict the 3D-structure of the Q932B9 and other pro-

karyotic sequences (Fig. 1) using various 3D model prediction servers

such as ModBase (http://alto.rockfeller.edu/modbasecgi/index.cgi),

3DPSSM (http://www.sbg.bio.ic.ac.uk/servers/3dpssm), and 3DJIG-

SAW (http://www.bmm.icnet.uk/servers/3djigsaw). However it was not

possible to generate a reliable 3D model due to non-availability of an

already known protein 3D-structure which is significantly similar in

sequence to the prokaryotic sequences.

Results and discussions

It is already known that the eukaryotic homeodo-main has amphipathic helices and these helices interact

with each other in three dimensions forming a hydro-

phobic core. It is interesting to note that a typical pro-

karyotic putative homeodomain sequence (Q932B9) also

contains amphipathic helices (Fig. 2) and they may be

expected to interact with each other forming a hydro-

phobic core as depicted in Fig. 2.

The results of CLUSTALW (Fig. 1) showed thatmany of these sequences had conserved residues Leu

and Phe/Leu at the gap of three residues in first helix as

in the case of homeodomain (www.bioinfo.de/isb/1999/

01/main.htm). Tryptophan (W), which has only one

codon, is not a very common residue in proteins and

which is also the least mutable amino acid was present in

Table 1

References to the prokaryotic sequences (Fig. 1) considered for multiple alignment

Sequence Organism Protein identification Coding region in the genome Frame

Seq 1 Staphylococcus aureus aureus N315 (Sa) Ref— NC003140.1 22,523–22,684 +2

Seq 2 Bacillus anthracis strain AMES (Ba) gnl— TIGR_198094— contig:

6615:b_anthracis

1,013,325–1,013,504 +3

Seq 3 Staphylococcus epidermidis (Se) gnl— TIGR_1282— 407 672,424–672,603 +1

Seq 4 Geobacillus stearothermophilus (Bs) gnl— OUACGT_1422—

bstearo.fasta.screen.Contig 375

5391–5230 )2

Seq 5 Enterococcus faecalis (Ef) Gnl— TIGR_1351— glf_11370 2,234,369–2,234,196 )1Seq 6 Clostridium difficile (Cd) gnl— SANGER_1496— Contig 30 68,031–68,204 +3

Seq 7 Listeria monocytogenesis (Lm) gnl— TIGR_1639—

contig:761:1_monocytogenes-4b

905,763–905,584 )2

Seq 8 Streptococcus mutans (Sm) gnl— OUACGT_1309—

smutans.fasta.screen.Contig 2

741,652–741,831 +1

Query

sequence

Staphylococcus aureus strain Mu50 (Sa) ref— NC_002758.1— NP_371375.1 — —

Seq 9 Neisseria gonorrhoeae (Ng) Gnl— OUACGT_485—

Ngon_contig 1

464,684–464,529 )3

230 S. Kant et al. / Biochemical and Biophysical Research Communications 299 (2002) 229–232

almost all putative homeodomain sequences at the same

position (Fig. 1). Functional significance of the con-

served nature of tryptophan is further strengthened bythe fact that the Trp 48 interacts with Leu 16 and Phe/

Leu20 of the first helix and Leu 31 of the second helix

and their interaction plays a very important role in

eukaryotic homeodomain 3D-structure [9,18]. The resi-

dues conserved at equivalent positions in bacterial

sequences might also be playing the same role as in

eukaryotic homeodomain 3D-fold.

The possibility of formation of hydrophobic core by

the three helices of the prokaryotic sequences, taken

together with the presence of conserved tryptophan (W)

residue, that can stabilize the characteristic 3D-fold as in

eukaryotic homeodomains, suggests that the prokary-

otic sequences are likely to assume a fold similar to thatof eukaryotic homeodomains.

Based on these findings, we propose the presence of

putative homeodomain in eubacteria. Further, it will

be of interest to find out their exact role in prokary-

otic cells. Structural similarity with eukaryotic home-

odomain suggests their probable role as transcription

regulator in cell division processes, which is the im-

portant prokaryotic event common with eukaryoticdevelopmental program. BLAST search also revealed

many hits in Staphylococcus species, which are mostly

pathogenic in nature. The genome context of the gene

for Q932B9 from Staphylococcus aureus aureus was

investigated (www.ebi.ac.uk) and it was found that the

location where it is present in the genome contains

genes that code for single strand DNA-binding pro-

tein (BAB57026), excisionase (BAB57010), integrase(BAB57009), and some hypothetical proteins. Thus,

based on gene context, some role-requiring interaction

with DNA may be expected for the pro-homeodo-

main.

Conclusions

Thus our work which combines motif based assign-

ments with BLAST database searches in a semi-auto-

matic protocol, with a view to identify distant

relationships in organisms has revealed pro-homeodo-main or homeodomain like sequences in prokaryotes

Fig. 1. Multiple sequence alignment of the homeodomain like proteins of prokaryotes showing the region of alignment. The representative sequences

(Table 1) were aligned using CLUSTALW [13] program and alignment was refined manually. The shading of conserved residues is according to the

consensus and includes residues conserved in at least 78% of the aligned sequences. In the consensus line, hydrophobic residues (L, I, V, M, A, F, W,

S, and T) are represented by h and polar residues (K, R, D, E, Q, S, T, H, and N) are represented by p. The secondary structure prediction was

performed using the PHD program [15]. On the multiple alignment H indicates a-helix. Abbreviation: Ba, Bacillus anthracis; Bs, Geobacillus ste-arothermophilus; Cd, Clostridium difficile; Ef, Enterococcus faecalis; Lm, Listeria monocytogenes; Ng, Neisseria gonorrhea; Sa, Staphylococcus aureus;

Se, Staphylococcus epidermidis; Sm, Streptococcus mutans.

Fig. 2. Helical wheel [17] representation of all the three helices of

Staphylococcus aureus SWALL code Q932B9. The three helical seg-

ments were as predicted by the PHD server [15]. Schematic represen-

tation of the possible association of the three helices constituting the

hydrophobic core is seen. Hydrophobic residues are enclosed in a

rectangle.

S. Kant et al. / Biochemical and Biophysical Research Communications 299 (2002) 229–232 231

which may be expected to have a 3D similar to that ofeukaryotic homeodomains.

A complex regulatory switch in eukaryotes requires

varied forms of protein–DNA interactions. In the case

of eukaryotic homeodomain proteins the N-terminal

region participates in recognition of DNA to augment

the specificity [19,20]. Due to comparatively simpler

regulation, the N-terminal specificity in prokaryotic

homeodomain like proteins might not be required. Fu-ture studies of role of these pro-homeodomains may

unravel many interesting features in prokaryotes. Most

interestingly, since these proteins are seen in pathogenic

bacteria also, they may be suitable targets for drug de-

sign provided they play some crucial role in cellular

events. Finally, our work also outlines a generally ap-

plicable method, which combines pattern and sequence

similarity searches for the identification of families anddetection of distant relationships in proteins.

Acknowledgments

We thank Mr. Raju Mukherjee, Mr. Kalyan Kumar Sinha, and

Mr. Rudresh for useful discussions. Access to Bioinformatics Centre

and Interactive Graphics Facility both funded by the Department of

Biotechnology (DBT) is gratefully acknowledged. A.B. thanks the

Council of Scientific and Industrial Research (India) for a fellowship.

Some of the sequence data for this project was obtained from The

Institute for Genomic Research website at http://www.tigr.org. Our

sincere thanks to all the funding agencies that have supported the se-

quencing projects at TIGR.

References

[1] M. Billeter, Homeodomain-type DNA recognition, Prog. Bio-

phys. Mol. Biol. 66 (1996) 211–225.

[2] W.J. Gehring, M. Affolten, T. Burglin, Homeodomain proteins,

Annu. Rev. Biochem. 63 (1994) 487–526.

[3] Y.Q. Qian, M. Billeter, G. Otting, M. Muller, W.J. Gearing, K.

Wuthrich, The structure of the Antennapedia homeodomain

determined by NMR spectroscopy in solution: comparison with

prokaryotic repressors, Cell 59 (1999) 573–580.

[4] B. Lewin, Genes VII, seventh ed., Oxford University Press,

Oxford, 2000.

[5] C. Sander, R. Schneider, Database of homology-derived struc-

tures and the structural meaning of sequence alignment, Proteins 9

(1991) 56–68.

[6] S. Banerjee-Basu, E.S. Ferlanti, J.F. Ryan, A.D. Baxevanis, The

homeodomain resource: sequences, structures and genomic infor-

mation, Nucleic Acids Res. 27 (1999) 336–337.

[7] D.S. Wilson, B. Guenther, C. Desplan, J. Kuriyan, High-

resolution crystal structure of a paired class operative Homeod-

omain dimer on DNA, Cell 82 (1995) 709–719.

[8] S.C Harrison, A.K. Agarwal, DNA recognition by proteins with

the helix-turn-helix motif, Annu. Rev. Biochem. 59 (1990) 933–

969.

[9] S. Banerjee-Basu, A.D. Baxevanis, Molecular evolution of the

homeodomain family of transcription factors, Nucleic Acids Res.

29 (2001) 3258–3269.

[10] C. Kappen, Analysis of a complete homeobox gene repertoire:

implications for the evolution of diversity, Proc. Natl. Acad. Sci.

USA 97 (2000) 4481–4486.

[11] S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang,

W. Miller, D.J. Lipman, Gapped BLAST and PSI-BLAST; a new

generation of protein database search programs, Nucleic Acids

Res. 25 (1997) 3389–3402.

[12] W.R. Pearson, D.J. Lipman, Improved tools for biological

sequence analysis, Proc. Natl. Acad. Sci. USA 85 (1998) 2444–

2448.

[13] J.D. Thompson, D.G. Higgins, T.J. Gibson, CLUSTALW

improving the sensitivity of progressive multiple sequence align-

ment through sequence weighting position-specific gap penalties

and weight matrix choice, Nucleic Acids Res. 22 (1994) 4673–

4680.

[14] K. Hofmann, P. Bucher, L. Falquet, A. Bairoch, The PROSITE

database its status in 1999, Nucleic Acids Res. 27 (1999) 215–

219.

[15] B. Rost, C. Sander, Combining evolutionary information and

neural network to predict protein secondary structure, Proteins 19

(1994) 55–72.

[16] A. Bairoch, R. Apweiler, The SWISS-PROT protein sequence

database and its supplement TrEMBL in 2000, Nucleic Acids Res.

28 (2000) 45–48.

[17] D.D. Womble, GCG: the Wisconsin package of sequence analysis

programs, Methods Mol. Biol. 132 (2000) 3–22.

[18] C.R. Kissinger, Crystal structure of an engrailed homeodomain–

DNA complex at 2.8�AA resolution: a framework for understanding

homeodomain–DNA interactions, Cell 63 (1990) 579–590.

[19] S.E. Ades, R.T. Saur, Specificity of minor-groove and major-

groove interactions in a homeodomain–DNA complex, Biochem-

istry 34 (1995) 14601–14608.

[20] M. Sharkey, Y. Graba, M.P. Scott, Hox genes in evolution:

protein surfaces and paralog groups, Trends Genet. 13 (1997)

145–151.

232 S. Kant et al. / Biochemical and Biophysical Research Communications 299 (2002) 229–232