proteins - ERNET
Transcript of proteins - ERNET
proteinsSTRUCTURE O FUNCTION O BIOINFORMATICS
Protein structure mining usinga structural alphabetM. Tyagi,1 A. G. de Brevern,2 N. Srinivasan,1,3 and B. Offmann1*
1 Laboratoire de Biochimie et Genetique Moleculaire, Bioinformatics Team, Universite de La Reunion, BP 7151,
15 avenue Rene Cassin, 97715 Saint Denis Messag Cedex 09, La Reunion, France
2 INSERM UMR-S 726, Equipe de Bioinformatique et Genomique Moleculaire (EBGM), Universite Paris Diderot - Paris 7,
case 7113, 2, place Jussieu, 75251 Paris Cedex 05, France
3Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India
INTRODUCTION
Protein Data Bank (PDB)1 offers to date more than
43,000 protein structures in the public domain. This data
encloses wealth of information, which plays a critical role in
our understanding of protein function, its evolution and
sequence to structure relationship that further unfolds
improved solutions to structure prediction and validation.
Mining of information from this huge amount of data plays
a crucial role, and most of the time, the information
includes the measure of structural similarities between two
or more proteins. Steady increase in number of known pro-
teins has made it impossible for manual inspection of each
and every protein present in PDB with rare exception of
SCOP2 database based on manual classification of protein
domains. To overcome this limitation, many groups have
developed various structure comparison methods.3,4
Structure comparison between two proteins has been of
major interest from the time when Perutz5 used structure
alignment to highlight structural similarities between myoglo-
bin and hemoglobin despite sharing low sequence similarities.
Common functionality between these two proteins is not acci-
dental; evolutionary relationship and shared structural features
are the reasons that describe the above phenomena. Structure
comparison methods are aimed to find these structural similar-
ities to foresee protein’s function, especially at low sequence
M. Tyagi’s current address is Computational Biology Branch, National Center for Bio-
technology Information (NCBI), National Library of Medicine (NLM), 8600 Rockville
Pike, Bethesda, Maryland 20894, USA
Abbreviations: PBs, protein blocks; SSEs, secondary structure elements; SA, structural
alphabet; LA, local alignment; GA, global alignment; PBE, protein block expert; rmsd,
root mean square deviation; EVD, extreme value distribution.
Grant sponsors: Conseil Regional de La Reunion (PhD grant to M.T.), French Institute
for Health and Medical Care (INSERM), University de la Reunion, University Paris
Diderot.
*Correspondence to: B. Offmann, Laboratoire de Biochimie et Genetique Moleculaire,
Bioinformatics Team, Universite de La Reunion, BP 7151, 15 avenue Rene Cassin, 97715
Saint Denis Messag Cedex 09, La Reunion, France.
E-mail: [email protected]
Received 16 August 2006; Revised 20 July 2007; Accepted 13 August 2007
Published online 14 November 2007 in Wiley InterScience (www.interscience.wiley.com).
DOI: 10.1002/prot.21776
ABSTRACT
We present a comprehensive evaluation of a new structure
mining method called PB-ALIGN. It is based on the encod-
ing of protein structure as 1D sequence of a combination
of 16 short structural motifs or protein blocks (PBs). PBs
are short motifs capable of representing most of the local
structural features of a protein backbone. Using derived
PB substitution matrix and simple dynamic programming
algorithm, PB sequences are aligned the same way amino
acid sequences to yield structure alignment. PBs are short
motifs capable of representing most of the local structural
features of a protein backbone. Alignment of these local
features as sequence of symbols enables fast detection of
structural similarities between two proteins. Ability of the
method to characterize and align regions beyond regular
secondary structures, for example, N and C caps of helix
and loops connecting regular structures, puts it a step
ahead of existing methods, which strongly rely on second-
ary structure elements. PB-ALIGN achieved efficiency of
85% in extracting true fold from a large database of 7259
SCOP domains and was successful in 82% cases to identify
true super-family members. On comparison to 13 existing
structure comparison/mining methods, PB-ALIGN emerged
as the best on general ability test dataset and was at par
with methods like YAKUSA and CE on nontrivial test
dataset. Furthermore, the proposed method performed well
when compared to flexible structure alignment method like
FATCAT and outperforms in processing speed (less than 45
s per database scan). This work also establishes a reliable
cut-off value for the demarcation of similar folds. It finally
shows that global alignment scores of unrelated structures
using PBs follow an extreme value distribution. PB-ALIGN
is freely available on web server called Protein Block
Expert (PBE) at http://bioinformatics.univ-reunion.fr/PBE/.
Proteins 2008; 71:920–937.VVC 2007 Wiley-Liss, Inc.
Key words: substitution matrix; protein blocks; local pro-
tein structure; structure mining; local alignment; global
alignment; structure comparison.
920 PROTEINS VVC 2007 WILEY-LISS, INC.
similarity, to study evolutionary relationship and basic
understanding of protein-folding problem.
Furthermore, comparison and alignment help in organi-
zation and classification of known proteins,6,7 mining of
similar known proteins for newly solved structure,8–11
identification of functionally important sequence pattern in
homologous proteins,12,13 and they also provide point of
reference for sequence alignment methods.14–16 Structure
comparison and alignment is a more challenging and com-
plicated problem due to a number of reasons. First problem
is about what to compare? Once this decision is made, it is
very difficult to obtain optimal alignment or identify
among many alignments, as there are number of ways to
align two structures. Also, in presence of high structural
similarity, it is challenging to make out if it arises from evo-
lutionary constraint or is just an analogy due to physical
constraint on fold space.
There are various methods for structure comparison
based on what level the structure is represented; some
methods use all atom model, but are limited to small
substructures17; more common approach involves back-
bone comparison of two proteins based on backbone
atoms, for example, Ca atom or internal distance matri-
ces9,18,19 or internal angles10,20,21 estimated from
backbone atoms. Alignment of structures based on initial
alignment of secondary structure elements (SSEs) and
further refinement through iteration is also a commonly
used approach.11,22,23 Graph theory-based SSE align-
ments provide another alternative solution for structure
comparison.24,25 Methods using backbone coordinates
for structure comparison relies on root mean square
deviation measure between two proteins, and objective
function has to minimize this value to identify structural
similarities. Such methods26–28 are useful for comparing
two proteins or substructures, but they are computation-
ally expensive, making them too slow for mining of simi-
lar structures from large database. Recently developed
methods like FlexProt29 and FATCAT30 use flexible
structure alignment approach by introducing twists
between aligned fragment pairs to improve overall super-
position and try to overcome the limitations of rigid
body structure alignment techniques.
Popular methods like DALI,9 SSAP,19 and CE18 use
reduced representation of protein backbone in terms of
distance matrices. DALI uses hexapeptide distance matri-
ces combined with dynamic programming and Monte
Carlo optimization technique to obtain global alignment
(GA). The combinatorial extension (CE) method com-
bines aligned short structural fragments into larger align-
ment paths and apply dynamic programming to generate
GA. Both these methods are most commonly used for
structure comparison and fast structure mining, though
sometimes absence of homolog in database can increase
search time considerably.
Most of these methods perform structure alignment
based on SSEs or use them to obtain initial starting
point. Protein structures can also be approximately
described using structural alphabets (SAs), which are
recurring short structural motifs found across protein 3D
space (for a review, see Offmann et al.31). Many groups
have identified these recurring short motifs capable of
describing protein backbone32–34 and are believed to be
more informative in protein structure analysis.35
Use of SAs for structure comparison has been
attempted only in the last decade. 3D-Blast is an example
of one such recently developed approach, which uses a
23-state SA to describe the backbone.36 This method
uses BLAST as a search method using a SA substitution
matrix to find the longest common substructures with
high-scoring segment pairs. It uses the E-value of BLAST
as measure of statistical significance of an alignment and
generates results with performance comparable to known
methods.
Using a set of 16 pentapeptide structural motifs known
as protein blocks (PBs),37,38 we have introduced a new
methodology of analyzing protein structures.39 Each of
these 16 motifs is represented by character alphabets (a,
b, c,. . ., p) and are described by vector of eight dihedral
angles (/, w) making it possible to represent 3D protein
structure by a string of 1D sequence of PBs. Taking
advantage of this reduced representation of protein struc-
ture as mere sequence of symbols, we recently derived a
PB substitution matrix and investigated its potential util-
ity in protein structure analysis31,39 or for discovering
functional local structural motifs.40
A new structure comparison method (PB-ALIGN) use-
ful for mining protein structural databases has been
developed. This approach is based on PB sequence align-
ment using the newly derived PB substitution matrix,
which has been developed.39 The basic premise of struc-
ture alignment is very simple and is based on encoding
of protein backbone by a sequence of characters repre-
senting PBs. Further, these PB sequences are aligned just
like amino acid sequences using dynamic programming
combined with a substitution matrix. Capability of PBs
to represent local structure variations and alignment of
these PBs provides more intuitive knowledge of structur-
ally similar regions in two proteins when compared to
SSE representation. Structure alignment based on PB
sequence is not only able to align regular substructures
but also the N and C cap regions. It also highlights struc-
tural variations in loops that connect regular secondary
structures. PB sequence alignment to obtain structure
alignment is a very fast procedure of structure mining
and allows large database mining in real time.41
In the present study, we provide a comprehensive
evaluation of the methodology compared to existing
techniques. We also provide more thorough analysis of
the efficiency rate of mining proteins from a large data-
base, using PBE server. In addition, we present optimal
gap penalty for both local42 and global43 alignment
techniques. Our results show that PB-ALIGN provides
Protein Structure Mining Using a Structural Alphabet
PROTEINS 921
equivalent or better efficiency rate in mining of struc-
tures from large database when compared to methods
like DALI,9 CE,18 and FATCAT30 and is much faster in
all of them. Our method achieved 89% success rate in
extracting true fold from pairwise alignment of 7259
against 7259 SCOP domains. In non-trivial cases, PB-
ALIGN provides comparable results to more robust and
complex methods and also gave satisfactory results while
handling multidomain protein. We addressed the ques-
tion of alignment score threshold for making decision
that two aligned structures correspond to same fold. The
statistical characteristics of the distribution of GA scores
were finally examined.
MATERIALS AND METHODS
Dataset used for evaluation of PB-ALIGN
In the present study, we have used batteries of test
dataset to assess the performance PB-ALIGN in different
experimental conditions. Database of 7259 SCOP (v1.65)
domains filtered at 95% identity implemented in PBE
web server8 are used for assessing the mining efficiency
of the method. Distribution of seven SCOP classes is as
follows: 1337 (18.5%) a domains, 2077 (28.6%) b domains,
1387 (19.0%) ab domains, 1529 (21.0%) a 1 b domains,
700 (9.6%) small domains, 89 (1.2%) multidomains, and
140 (1.9%) membrane domains. PB-ALIGN was com-
pared with 13 existing structure mining/comparison
methods based on three different datasets. The general
ability of the methods to extract similar structure pro-
teins was tested on 61 query proteins belonging to ten
protein families, representing the four CATH main classes
(mainly a, mainly b, mixed ab, and few secondary struc-
tures). Same dataset was used in two independent studies
done by Novotny et al.3 and Carpentier et al.10 Ability of
the methods to handle multidomain proteins was eval-
uated based on two multidomain queries selected by
same groups. Fourteen nontrivial query-target pairs were
taken from study done by Carpentier et al.10 to test the
robustness of the method in detecting difficult structural
similarities. Furthermore, we performed comparison of
PB-ALIGN with flexible structure alignment program
FATCAT, based on pairwise alignment of 10 difficult pairs
as used by Ye et al.30
Encoding 3D structures into PB sequence
Local backbone features of a protein can be repre-
sented by 16 prototypes of five-residue long motifs called
PBs.38 Each PB is characterized by vector of eight dihe-
dral (/, w) angles associated with five consecutive Caatoms, and the 16 PBs are denoted by a character set
varying from a to p. Encoding of protein backbone into
PB sequence is a two-step process: (i) coordinates of
backbone atoms are used to calculate sequence of (/, w)
angles; (ii) an overlapping window of eight (/, w) angles(corresponding to five Ca residues) is moved along the
backbone. PBs for each window is assigned on the basis
of smallest dissimilarity measure called root mean square
deviation on angular values or rmsda44 calculated
between observed (/, w) values in the window and the
standard dihedral angles for various PBs. By following
the above simple procedure, a 3D structure of a protein
can be encoded into a 1D sequence of PBs representing
local structural information as sequence of SAs.
PB substitution matrix
A 16 3 16 PB substitution matrix has been recently
derived by our group.45 The substitution scores between
PBs were evaluated by counting the number of substitu-
tions occurring in conserved regions of structurally
aligned homologous proteins. These proteins are selected
from large database, PALI,46,47 containing structure-
based pairwise and multiple alignments of homologous
proteins of known three-dimensional structures. The data-
base uses a rigid-body superposition program, STAMP,48
to generate structure-based alignments. In total, 21,503
pairwise alignments from 1197 SCOP families were
analyzed, which accounted for more than 2,000,000 PB
substitutions. The raw frequencies are normalized and
expressed as the log-odds score. The obtained scores pro-
vide extent of preference of a PB in a protein for its reten-
tion or substitution and allow evaluating equivalence
between homologous structures. The matrix has been
validated in our previous studies and has been shown to
be useful in identification of structurally equivalent
regions in two proteins. In addition, the matrix has
potential applications in differentiating between confor-
mational differences and rigid body shifts among homol-
ogous protein structures.45
Gap penalty optimization
In our previous study, we selected an arbitrary gap
penalty of 20.5 on manual inspection of PB alignments.
Here, we follow extensive procedure to suggest optimal
gap penalties. Penalty optimization procedure is based on
two criteria: effect of gap penalty on overall mining effi-
ciency of similar structure proteins and quality of align-
ments generated. Structure mining efficiency is measured
by counting number of times a true hit at class, fold,
super-family, and family level is obtained when Top 10,
5, and first ranking alignments are considered for a given
query. Quality of alignment is measured in terms of
rmsd value obtained from superimposition of protein
pairs based on PB alignment. Superimposition is per-
formed using ProFit49 software, where equivalent zones
are specified by PB alignment.
We performed a comprehensive study to suggest opti-
mal gap penalty for both local alignment (LA) and GA
M. Tyagi et al.
922 PROTEINS
algorithms using 2000 randomly sampled domains. A
database of 2000 3 2000 pairwise PB alignments was
generated to perform above two analyses. Attention was
given to keep the relative proportion of seven major
classes similar to that in original databank. Jackknife
approach was used to measure mining efficiency, and
alignment quality measure was done by considering only
pairs belonging to same family. For GA of PB sequences,
we used following set of gap penalties 20.5, 22.0, 22.5,
23.0, and 25.0, and optimal gap penalty for LA algo-
rithm was selected from the following set of penalties
20.5, 22.0, 23.0, 25.0, and 27.0.
RESULTS AND DISCUSSION
Effect of gap penalty on mining similarprotein structures
Using 2000 randomly sampled domains, we assessed
efficiency of both local and GA techniques to extract
structurally similar proteins at class, fold, super-family,
and family levels for a given gap penalty. For a given
query, hits are calculated by considering Top 10, 5, and
first ranking alignments. In the following analysis, we
present results from Top 10 ranking alignments.
Table I reports efficiency rate for mining proteins at
class, fold, super-family, and family level considering Top
10 ranking alignments based on GA algorithm. Bold val-
ues indicate best efficiency rate achieved at each level.
With varying gap penalty from 20.5 to 23.0, negligible
effects on efficiency of extraction of proteins was
detected. Not more than 0.6% of change was seen in effi-
ciency rate in this range of gap penalties. Further increase
reduces the efficiency of the method by almost 3% as
illustrated from low success rate achieved at a penalty of
25.0. Among penalties used in this analysis, gap penalty
of 22.0 seems to give the best results, though perform-
ance was not very much higher for other penalties used.
Table II shows the success rate of mining similar pro-
teins using LA at class, fold, super-family, and family
level when Top 10 ranking alignments are taken into
account. Because of the basic nature of the algorithm, it
was suspected that the higher gap penalty will yield bet-
ter results and efficiency rate can be inferior to GA.
Indeed the two assumptions are true from the above
tables. Efficiency rate has increased by almost 10% at
fold level by changing penalty from 20.5 to 22.0, but
overall success rate is slightly lower when compared to
GA results. One of the reasons for low efficiency com-
pared to GA can be due to the dataset used in our analy-
sis. Since we have taken well-defined domains as query
against well-defined domain database, GA has an advant-
age here due to the basic nature of the algorithm. This
advantage can be a limiting factor for GA in case we use
complete protein chains as query without any knowledge
of domain boundary. Indeed this is further documented
in the following sections where LA outperforms in real
case scenario and is able to extract true domains with
high scores from the database, whereas GA fails due to
more number of gaps introduced in the alignment. On
varying penalty from 22.0 to 27.0, the variation on effi-
ciency rate is very moderate, though best results are
obtained at 23.0 or 25.0 gap penalty.
Similar efficiency rates achieved by neighboring gap
penalties, both in local and GA techniques indicate varia-
tion in gap penalty increases or decreases success rate
only to certain extent. No clear favorable gap penalty can
hence be considered as optimal. From manual inspection
of alignments obtained from various penalties indicated
even though mining rate is somewhat constant, gap pen-
alty can have more impact on quality of alignment pro-
duced. Based on this assumption, we further studied the
relationship between gap penalty and quality of align-
ment in the following section.
Effect of gap penalty on structuralalignment quality
In this analysis, we performed a very simple exercise
whereby, for each gap penalty, we generated both local
and GAs between pairs of homologous structures belong-
ing to the same family. To assess the quality of PB-based
alignments, we used rmsd values from the superim-
position of aligned residues. Each PB alignment was
Table IOptimization of Global Gap Penalty
Level/gap penalty
20.5 21.0 22.0 22.5 23.0 25.0
Class 98.1 97.9 97.9 97.95 98.05 97.9Fold 66.35 66.55 66.9 66.5 66.25 63.85Super-family 61.5 61.65 61.65 61.35 61.05 58.7Family 55.5 55.95 56.6 56.6 56.2 54.45
Effect of gap penalty on mining rate at class, fold, super-family, and family level
for global alignment. The results are from top 10 ranking alignments. Analysis
was performed on 2000 randomly selected SCOP domains.
Table IIOptimization of Local Gap Penalty
Level/gap penalty
20.5 22.0 23.0 25.0 27.0
Class 89.9 93.3 93.9 94.2 94.25Fold 50.55 60.55 62.9 62.75 61.35Super-family 49.15 58.15 60 60 59.05Family 44.45 52.95 54.1 53.95 52.8
Effect of penalty on mining rate at class, fold, super-family, and family level.
Results are from top 10 ranking alignments. Analysis was performed on 2000 ran-
domly selected SCOP domains.
Protein Structure Mining Using a Structural Alphabet
PROTEINS 923
converted into corresponding amino acid (AA) alignment
and was presented to ProFit software, which further per-
formed least square fit of backbones based on AA align-
ment. List of rmsd values for every pair was complied at
different gap penalties. For comparison of overall effect
of gap penalty on rmsd values, we plotted the average
improvement in rmsd values on different gap penalties
with respect to rmsd values obtained at penalty of 20.5.
Basically, we have tried to highlight the change (decrease)
in rmsd values of various pairs at different gap penalties
when compared to values obtained at gap penalty of
20.5. Figure 1(a) shows the increase in improvement of
average rmsd values at different gap penalties (22.0,
22.5, 23.0, and 25.0) for GA algorithm. From the
above figure, it is very clear that increase in negative gap
penalty has shifted more number of pairs toward lower
rmsd values. For example, on varying gap penalty from
20.5 to 23.0, almost 18% of alignment pairs have lower
rmsd values in interval of 0.5 to 1 A, and gap penalty of
23.0 and 25.0 has brought improvement of 1 A or
more to almost 44 and 47% homologous pairs, respec-
tively (data not shown). Figure 1(b) shows similar
improvement in LA quality by varying gap penalty from
20.5 to 27.0. Once again overall improvement in rmsd
values is seen at various gap penalties. Most fruitful pen-
alties were 25.0 and 27.0, where average improvement
of more than 2 A is observed.
Based on efficiency of mining similar proteins and
improvement in quality of alignment at various penalties
level, it was found that 23.0 and 25.0 were optimal gap
penalties for global and LA algorithm, respectively. For
GA, penalty of 22.0 yielded best results for extraction of
similar proteins, but gap penalty of 23.0 was optimal
value in terms of overall alignment quality and mining
rate. In case of LA algorithm, even though gap penalty of
27.0 was able to give better rmsd values, the extraction
rate was inferior to the penalty of 25.0 by almost 1%,
and hence 25.0 was chosen as optimal penalty having
balanced results for both alignment quality and mining
of proteins.
Mining of protein structures
Efficiency of PB-ALIGN method to extract structurally
similar proteins at different SCOP classification level was
tested in the following study. We have analyzed 7251
domains selected from SCOP data bank filtered at 95%
identity and performed all-against-all pairwise GA of PB
sequences with an optimized gap penalty of 23.0. A class
confusion matrix has been generated from 7259 3 7259
pairwise alignments to assess discriminatory power of
simple PB alignments to assign correct SCOP class. The
method was evaluated by counting if the true class, fold,
super-family, or family member is present within Top 10
hits, ranked by normalized score. It is noteworthy that
performance of PB-ALIGN was evaluated in a jack-knife/
leave-out approach, where each query domain was
removed from database prior to testing. This corresponds
Figure 1Effect of gap penalty on (a) global alignment and (b) local alignment. Figure gives the mean improvement (decrease) in rmsd value (Y axis) at different negative gap
penalties (X axis) with respect to rmsd values at a gap penalty of 20.5. As shown, with increase in negative penalty, there is an improvement in superimposed rmsd
values compared to values obtained at a penalty of 20.5. In case of local alignment (b), there is large improvement in alignment quality as negative gap penalty is
increased. Even though 27.0 gives better mean improvement in rmsd, 25.0 was chosen as the desired penalty as a balance between alignment quality and mining
efficiency.
M. Tyagi et al.
924 PROTEINS
to real-life situations when one will have to query struc-
tural databases for mining similar folds. It is also this
standard that is largely used for evaluation performance
of structure mining methods.3,4,9,10,50 However, so as
to evaluate efficiency of PB-ALIGN to assign high classi-
fication levels and capture remote homology, we also
evaluated the situation where members from same family
as query were removed. So the dataset size used in all
cases dynamically changed depending upon the query
protein.
Table III summarizes the results obtained for each
level, mainly class, fold, super-family, and family at three
different ranks, 1st, 5th, and 10th. True class of a query
protein can be found with an efficiency rate ranging
between 92.5% and 99.1% when first 10 ranked align-
ments were considered. Possibility of finding true fold
among Top 10 hits showed distinct performance with a
hit rate of 62.6% when whole family is jack-knifed and
87.4% when only query is left-out and similar perform-
ance were observed at super-family level. Taking into
account only the first hit (Top 1 column in Table III), we
found between 76.1% and 93.1% success in finding true
class. Among these first ranked hits, about 81.3% was
from same fold when only query was jack-knifed but of
47.4% when whole family was removed. Similar perform-
ance was obtained at super-family level (53%–79%).
Finally, on average, one is able to find true family of pro-
tein in 80% of cases.
It is further evidenced from our results that, when
higher level is properly identified, the chance to identify
subsequent lower level is still very good. The biggest
decrease in prediction efficiency is observed between class
and fold levels. However, it is noteworthy that because
we look only at Top 10 or Top 1 for performance evalua-
tion, in some instances, for example, the query d1g73a_
from scop family a.7.4.1, the good hit is found only at
lower rank, here at 69th rank because top scores were
populated with redundant hits from homologous mem-
bers of other folds (e.g., 23 hits from a.1.1.2. family, 14
hits from a.1.1.3 family, etc.).
These overall results indicate that mining similar struc-
tures using simple PB alignment methodology is per-
forming reasonably well for identifying class and super-
family relationships despite the presence of confusion
across fold alignments. Indeed, when whole family was
left-out, structural relationship at fold level was more dif-
ficult to capture using PB-ALIGN when compared to
other structural levels. This highlights the importance of
quality of information and its coverage in the databases
used for mining structures. However, in real life situa-
tions, users of structure mining tools such as PB-ALIGN
generally expect that their structures be compared to
whole PDB or representatives from all families. Hence,
overall result presented in Table III is useful to assess
how the method is performing when a query has a coun-
terpart from the same family in the PB-ALGN database.
It also shows that the method is able to capture higher-
levels relationship (super-family or fold) to some extent
when the query has no homologs in the database. These
encouraging results demonstrate the feasibility of the
method to be useful in projects like structural genomics.
What Table III does not tell us is how much confusion
exists between SCOP classes due to reduced representa-
tion of 3D structure using PB alignment. To address this
question, a class confusion matrix was generated, using a
7259 3 7259 pairwise alignment, as shown in Table IV. It
is important to analyze this confusion because by using
1D PB representation some topological information
maybe lost and it becomes crucial for proteins sharing
similar succession of SSEs. Again both situations where
query-only or whole family related to query is removed
from the database were analyzed.
As shown in Table IV, b class was most efficient
(89.3%–96.5%) to identify itself, closely followed by aand ab class, which have efficiency rates of 83.7%–95%
and 83.8%–95.7%, respectively. a 1 b class was found to
be confused with other classes with a 88.6% success rate
when only query was left-out and a 56.4% rate when
family was jack-knifed. Almost half of the false hits from
a 1 b are confused with ab class. Overlap between ab
Table IIIEfficiency Rate of Mining Proteins at Various SCOP Classification Levels
SCOP level
Only query domain isremoved from database
Whole family related to query isremoved from database
Top 10 Top 5 Top 1 Top 10 Top 5 Top 1
Class 99.1 (7194) 96.8 (7028) 93.1 (6758) 92.5 (6716) 88 (6394) 76.1 (5529)Fold 87.4 (6343) 85.6 (6217) 81.3 (5906) 62.6 (4548) 57.5 (4178) 47.4 (3438)Super-family 84.3 (6122) 82.8 (6011) 79.0 (5739) 65.1 (4727) 60.8 (4412) 53.0 (3846)Family 80.0 (5809) 78.7 (5714) 75.1 (5453) n/a n/a n/a
Results are reported for top 10, 5, and 1st ranking alignments. Two situations were distinguished; one where only the query is removed from the database and another
where the whole family was removed from the database. Values are given as percentage.
n/a: not applicable.
Protein Structure Mining Using a Structural Alphabet
PROTEINS 925
and a 1 b is understandable taking into the fact that
both have successions of helical and sheet regions. Per-
formance for identifying small proteins was equivalent to
a 1 b class with rates of 65.3%–89.8%. Other two
classes have very contrasting results based on the jack-
knife procedure; when whole family is removed, proba-
bility to get true class as Top 1 hit drops from 72.8%–
78.6% to 20.2%–37.8% for membrane and multidomain
proteins. Multidomain proteins are mostly confused with
ab class while membrane proteins, as expected, are
mostly confused with a class.
Computation of class recognition matrix, for example,
confusion matrix, using large number of domains high-
lights the efficiency of PB alignment. Decent efficiency
rate is an indication that reduced complexity of 3D space
and absence of topological information in PB representa-
tion has not affected the discriminatory power of PB
alignment. This efficiency level can be attributed to com-
bination of PBs connecting similar SSEs in different top-
ologies (see below).
Efficiency rate within SCOP classes
Each class was studied separately to quantify how suc-
cess rate at fold, super-family, and family was distributed
within each class. Table V gives success rate for seven
major SCOP classes, namely a, b, ab, a 1 b, multido-
main, membrane, and small proteins, when whole family
related to the query was removed or when only the query
was jack-knifed.
Success of finding true fold among Top 10 hits was
best for b and ab class with an efficiency of 71.1%–93%
and 64.9%–92.2%, respectively, followed by membrane
(87.8%–90.7%), small (66.7%–89.4%), a 1 b (66.5%–
87.5%), and a (62.3%–86.7%) class proteins. Similar
trend was followed at super-family and at family level.
Looking at hits that ranked first, b and ab classes
achieved 53.8%–88.7% and 48.9%–87.8% of success,
respectively, in finding true fold compared to a class
where 45.2%–77.1% efficiency was reached. Presence of
long helical regions in a proteins can be one of the rea-
sons for more confusion among various folds in a class
(as illustrated in the following discussion). Interestingly,
distribution of success rates at three SCOP levels very
well indicate that true hits at fold and super-family levels
are not only populated by family members. These results
are on same line as observed in Table III. Analysis of
above results indicate that PB alignment is able to locate
structure similarities even at very low sequence similar-
ities (super-family relationships), and thus our method
can be used to detect remote homologs for a given
protein.
Closer look at the cases of failure where query protein
was not able to find its true fold, super-family, or family
within Top 10 ranks gave insight to current limitations
of structure mining. Presence of single member folds,
super-families, or families in database was most common
contributor to absence of true hit. In some instances, di-
versity within family both in terms of length and struc-
tural features makes it difficult for the GA algorithm to
extract true member. For example, ABC transporter pro-
tein from Sulfolobus solfataricus (SCOP domain d1oxsc1)
from MOP-like super-family (SCOP code b.40.6.3) is
almost 40 residues longer than rest of the members. In
this case, GA algorithm has to introduce large number of
gaps to accommodate shorter proteins from the same
family resulting in low alignment scores hence low rank.
Application of LA provides an alternative solution in this
Table IVClass Confusion Matrix
True class vs. hit class a b ab a 1 b MultiDoma Membrane Small Total
a 1271 (95.0%) 1 12 12 0 35 6 13371120 (83.7%) 5 47 53 4 88 20
b 2 2005 (96.5%) 10 36 0 3 21 20773 1855 (89.3%) 20 120 2 18 59
ab 7 8 1328 (95.7%) 39 3 0 2 138731 20 1163 (83.8%) 145 20 3 5
a 1 b 34 40 77 1356 (88.6%) 4 3 15 152999 189 301 863 (56.4%) 22 4 50
MultiDom 3 2 11 2 70 (78.6%) 0 1 896 5 50 9 18 (20.2%) 0 1
Membrane 29 6 0 2 0 102 (72.8%) 1 14064 17 0 4 0 53 (37.8%) 2
Small 23 24 3 20 0 1 629 (89.8%) 70053 115 5 69 0 1 457 (65.3%)
Total 7259
aMultiDom corresponds to multidomain protein class.
Matrix gives the efficiency of the method to find true class at first rank and the confusion rate between SCOP classes. Results were generated from 7259 3 7259 pairwise
PB alignments. True classes are featured horizontally and predicted classes vertically. Two situations were distinguished within each class; one where only the query is
removed from the database (top line) and another where the whole family was removed from the database (bottom shadowed line).
M. Tyagi et al.
926 PROTEINS
case and enabled to extract at least one member protein
in top hits. Human hyperplastic discs protein (SCOP do-
main d1i2ta_) from PABC (PABP) domain family (SCOP
code a.144.1.1) is another such example where LA was
able to extract true hit among top ranks and GA was
unsuccessful. Note should be taken that not in all such
cases success is achieved by LA approach. For example,
Haloarcula marismortui protein (SCOP domain d1jj2s)
from Ribosomal proteins L24p and L21e family (SCOP
code b.34.5.1) is one such example where both diversity
in structural features and protein length plays a role in
unsuccessful results. Such examples are challenging and
provide an opportunity to refine and improve our
approach.
Furthermore, PB alignment technique had some prob-
lems with the pair of proteins sharing long stretch of reg-
ular secondary structures, for example, long helices in aproteins. In PB sequence, these regions are represented as
long stretch of PB m; alignment of such regions artifi-
cially contributes to the global score and put them in
high rank. This example is very well illustrated from
pairwise alignment of ribosomal protein L12 from Ther-
motoga maritima (SCOP domain d1dd3a1) and ROP
protein from E. coli (SCOP domain d1b6q__).
Figure 2 shows good alignment of helical regions
(sequence of PB m), major reason for having high align-
ment score. Closer look at alignment indicates presence
of extra loop in domain d1dd3a1, which is absent in
other protein. Detection of this extra loop in one protein
can hint in the difference in relative orientations of heli-
ces in two proteins even though the alignment score is
high and is evident from Figure 2. This example high-
lights that sometimes alignment score can be misleading
due to the high content of regular structures in two pro-
teins but closer look into PB alignment can give clues to
the structural differences. In such cases, disadvantage of
1D representation can be overcome by having a manual
inspection to detect local variations present in-between
regular structures.
Comparison with existing methods
Performance of PB-ALIGN has been tested against 13
structure comparison methods (Table VI). We applied
Table VEfficiency Rate of Mining Similar Structure Proteins With in Each SCOP Class
Only query domain wasremoved from database
Whole family related to querywas removed from database
Top 10 Top 5 Top 1 Top 10 Top 5 Top 1
a (1337)Fold 86.7 (1160) 84.3 (1128) 77.1 (1031) 62.3 (833) 56.0 (749) 45.2 (604)Super-family 83.2 (1113) 81.4 (1089) 74.4 (995) 69.8 (933) 64.4 (862) 56.6 (757)Family 77.4 (1035) 75.0 (1002) 67.0 (922) n/a n/a n/a
b (2077)Fold 93.0 (1931) 91.8 (1907) 88.7 (1842) 71.1 (1478) 66.3 (1378) 53.8 (1118)Super-family 90.0 (1869) 88.8 (1844) 85.8 (1782) 70.5 (1464) 65.1 (1353) 53.3 (1108)Family 87.2 (1812) 86.0 (1786) 83.1 (1726) n/a n/a n/a
ab (1387)Fold 92.2 (1279) 91.2 (1265) 87.8 (1218) 64.9 (901) 58.5 (812) 48.9 (679)Super-family 90.0 (1248) 89.0 (1234) 86.1 (1195) 63.4 (880) 57.0 (791) 48.8 (678)Family 83.3 (1156) 82.6 (1146) 80.0 (1110) n/a n/a n/a
a 1 b (1529)Fold 87.5 (1338) 85.7 (1310) 81.9 (1253) 66.5 (1017) 59.9 (916) 51.5 (788)Super-family 84.3 (1290) 82.6 (1264) 79.2 (1211) 69.8 (1068) 65.0 (995) 59.7 (913)Family 79.8 (1220) 78.3 (1198) 74.9 (1145) n/a n/a n/a
Small (700)Fold 89.4 (626) 85.8 (601) 73.8 (517) 66.7 (467) 61.7 (432) 47.1 (330)Super-family 84.7 (593) 80.8 (566) 70.6 (494) 61.5 (431) 58.2 (408) 49.0 (343)Family 79.4 (556) 76.8 (538) 66.4 (465) n/a n/a n/a
MultiDom (89)Fold 86.5 (77) 86.5 (77) 85.3 (76) 66.3 (59) 66.3 (59) 64.0 (57)Super-family 86.5 (77) 86.5 (77) 86.5 (76) 66.3 (59) 66.3 (59) 64.0 (57)Family 76.4 (68) 76.4 (68) 76.4 (68) n/a n/a n/a
Membrane (140)Fold 90.7 (127) 87.8 (123) 80 (112) 87.8 (123) 85.0 (119) 67.8 (95)Super-family 75.7 (106) 73.5 (103) 69.3 (97) 88.5 (124) 88.5 (124) 85.7 (120)Family 73.5 (103) 71.4 (100) 67.8 (95) n/a n/a n/a
Efficiency is calculated at three different ranks top 10 hits, top 5 hits, and 1st hit. Figures within brackets show the total number of queries taken into account or num-
ber of true hits, e.g., small (700) means in total we did this exercise for 700 proteins domains. Efficiency presented is in percentage, i.e., true hits/(true hit 1 false hit).
n/a: not applicable.
Protein Structure Mining Using a Structural Alphabet
PROTEINS 927
batteries of tests to PB-ALIGN to obtain a comprehensive
comparison with existing methods which included gen-
eral efficiency of the method to extract-related proteins,
ability to identify difficult structure similarities in data-
base search, performance to handle multidomain pro-
teins, and comparison with flexible structure alignment
method like FATCAT. Evaluation results of existing meth-
ods are taken from recent studies done by Novotny
et al.3 and Carpentier et al.,10 where 12 structure com-
parison methods were evaluated. Comparison with flexi-
ble structure alignment method is based on results pro-
duced by Ye et al.,30 where 10 difficult pair alignments
are compared between VAST, DALI, CE, and FATCAT.
Dataset of 61 queries to compare general ability to
extract related proteins and two multidomain protein
queries are taken from the study done by Novotny et al.
We followed same evaluation procedure as done by
Novotny et al. and Carpentier et al. with one basic differ-
ence, both the studies followed CATH6 classification for
test dataset and in our study we have used SCOP2 classi-
fication for evaluation procedure. This has been done
because PB-ALIGN uses database of SCOP domains fil-
tered at 95% identity and many times there are differen-
ces in both classification schemes. A hit is counted as
true hit if it belongs to same SCOP super-family or fam-
ily level. It should be noted that some of the methods
involved in this comparative study use significance
threshold for the scores obtained to discriminate same
fold from different folds but not all. However, our
method is not initially based on the definition of a
threshold measure and considers top hits in every mining
exercise to evaluate the relative performance of methods.
Nonetheless, measures have been developed to assess sig-
nificance of the alignments (see paragraph cut-off below
for details). Thus, we applied here similar protocols to
Novotny et al. that have used few hits to compare those
methods, which did not return the significance of hits.3
From our initial tests, we found out that by using
complete protein chain as query, that LA algorithm gives
far better results when compared to GA technique. This
is obvious from the fact that many times protein chains
are longer than actual domain boundaries and using GA
in such cases increases number of gaps in alignment pro-
cedure. To avoid above pitfalls, we have used LA for our
analysis.
Figure 2PB alignment-based superimposition of SCOP domain d1dd3a1 (green) and
d1b6q__ (blue). PB alignment illustrates how long successions of PB m can
contribute to alignment score. Presence of extra loop in protein 1DD3 is
indicated in red color. It shows how small variation at local level can bring
change in orientation of regular structures. Structural alphabet notation was
explained by de Brevern et al.37,38
Table VIStructure Mining/Comparison Methods Tested in the Present Study
Program URL Methodology used
CE http://cl.sdsc.edu/ce.html Inter residue distancesDALI http://www.ebi.ac.uk/dali Ca distance matricesDEJAVU http://xray.bmc.uu.se/usf/dejavu.html SSEs comparisonFATCAT http://fatcat.ljcrf.edu/ RMSD and introduction of twistsLOCK http://brutlag.stanford.edu/lock2 RMSD minimizationMATRAS http://biunit.aist-nara.ac.jp/matras Markov transition modelPB-ALIGN http://bioinformatics.univ-reunion.fr/PBE/PBE-ALIGN.htm PBs substitution matrix and alignmentPRIDE http://hydra.icgeb.trieste.it/pride Ca distance distributionSSM http://www.ebi.ac.uk/msd-srv/ssm SSEs vector comparisonTOPa http://bioinfol.mbfys.lu.se/top SSEs alignmentsTOPS http://balabio.dcs.gla.ac.uk/tops/compare.html SSEs symbolic representation and comparisonTOPSCAN http://www.bioinf.org.uk/topscan SSE representation in topology strings, aligned through
a global dynamic alignment algorithmVAST http://www.ncbi.nlm.nih.gov/Structure/VAST/vastsearch.html SSEs graph representationYAKUSA http://www.rpbs.jussieu.fr/yakusa Internal coordinates matching
aWeb link unreachable.
M. Tyagi et al.
928 PROTEINS
Using LA, we queried each protein chain against the
databank and complied results simply by counting if true
hit is found within Top 10 alignments, number of mem-
bers found in Top 10, and rank of 1st false positive.
Alignments are ranked based on score generated by LA
algorithm plus normalized and Z scores are also reported
for each hit. Out of 61 queries, we found two cases (1rlr
and 1vmo) had no super-family and family members in
our databank except themselves. In total, we tested 59
test cases for general efficiency of the method and com-
pared our results with the results reported by Carpentier
et al. (Table VII). Overall PB-ALIGN performed with a
success rate of 96.6% when Top 10 ranking alignments
are considered.
Our method performs correctly in all except two cases.
Toxin protein 1ciy (PDB id) from Bacillus thuringiensis
has three domains, namely d-Endotoxin C terminal, mid-
dle and N terminal domain. PB-ALIGN is able to extract
C terminal and N terminal domains very easily in top
hits but misses out target middle domain (SCOP code
b.77.2.1) of 1ciy from initial hits and is found only at
48th rank. Because of low rank in hits, we have counted
this hit as negative in our final results. Second protein
1gj (PDB id) from E. coli has two domains and target
domain (GreA transcript cleavage protein, C terminal do-
main, SCOP code d.26.1.1) is present as lone member of
family in our database. PB-ALIGN is not able to identify
target family (single member family) or super-family
members among top hits. Target super-family members
are obtained at 12th and 16th rank if mining is per-
formed with gap penalty of 23.0. Our assessment also
found out that for only five queries out of 59 tested, the
rank of first false positive was above the rank of true hit
and there was not a single case where first true positive
was ranked after a false hit. Moreover, time taken to
query each protein was less than a minute, making it one
of the fastest and efficient structure comparison method
among the 13 methods (see Table VII) evaluated in pres-
ent study.
Performance of PB-ALIGN onnontrivial dataset
We further tested our method’s performance on a non-
trivial dataset used by Carpentier et al.,10 which is con-
stituted of 14 difficult cases. We queried each protein
using both LA and GA algorithm against our database
and considered first 100 alignments as well as a cut-off
score value of 20.25 for GA (see below). Results of this
exercise are comparable with previous data published by
Carpentier et al.10 On this limited dataset, PB-ALIGN
performance was found to be at par with YAKUSA and
CE and gave about 50% success rate with combined
effort of LA and GA algorithm. However, the efficiency
of PB ALIGN as a pairwise comparison tool still needs to
be assessed in an in-depth benchmark in comparison
with extensively tested, robust and standard methods
such as CE, DALI, and VAST. Nevertheless, testing of
both LA and GA approach on nontrivial dataset gave
very interesting insight of our method. LA algorithm was
able to get only four targets out of the 14 tested, whereas
GA found a total of six targets with three extra hits from
LA results and missing out one case found by LA. Table
VIII gives the summary of results obtained using both
the approaches. Application of GA not only identified
more number of difficult targets but also improved the
rank of target in two cases (Table VIII). When cut-off
value (see next section) is applied as a rule for decision,
GA is able to capture seven targets above the threshold
value. Like YAKUSA, PB-ALIGN uses local structural fea-
tures to describe and encode protein backbone, and iden-
tifying distantly related proteins can be difficult due to
the fact that such pairs share structural similarities at
global level rather than local level. Ability of PB-ALIGN
to use GA algorithm to capture such remote similarities
at global level using local descriptors (PBs) sets it apart
from methods like YAKUSA.
Test case 1crb and 1bge from mainly-a class demon-
strate this fact very clearly by analyzing GA of query
(1crb) and target (2gf) protein PB sequences as shown in
Figure 3(a). Using GA, we were able to find target for
query protein 1crb among top hits with low alignment
score. Looking at PB alignment, it becomes very obvious
why LA method was not able to capture these global
similarities. Both the proteins share four helical regions
(populated by PBs m) separated by loops and small
strands (populated by PBs c and d) and two middle heli-
cal regions are of different length. Such global similarity
Table VIIComparison of PB-ALIGN With Existing Structure Mining/Comparison Methods
ProgramMainlya (19)
Mainlyb (19)
Mixedab (15)
FewSSEs (8)
Total(%)
PB-ALIGN 18a 17b 14 8 96.6YAKUSA 17 19 14 8 95CE 17 19 13 8 93DALI 14 19 14 8 90MATRAS 11 19 14 8 85VAST 12 17 15 7 84TOP 14 18 12 7 84DEJAVU 14 19 9 4 75TOPSCAN 15 12 9 7 70TOPS 2 15 14 7 62PRIDE 14 14 7 3 62LOCK 0 14 11 8 54SSM 5 13 10 5 54
Comparison of PB-ALIGN with 12 structure mining/comparison methods based
on results from Carpentier et al.10 The numbers along with the header give total
number of queries belonging to each class. All the hits are counted based on first
10 ranking alignments compared to 100 hits taken by Carpentier et al., only for
those methods that did not return the significance of hits.aOne query has no target in our database.bFor mainly b class, query protein 1vmo has no target in our database and query
1ciy misses target in top ten ranks.
Protein Structure Mining Using a Structural Alphabet
PROTEINS 929
can only be captured by GA algorithm having lower gap
penalty compared to high penalty imposed by LA
approach. Use of simple GA algorithm combined with
PB substitution table highlights subtle similarities identi-
fied by PB-ALIGN method based on local backbone
descriptors (PBs). Figure 3(b) also shows a query protein
(1bge) for which target protein (2gf) was not found even
after using GA. Close inspection of PB alignment reveals
four helical regions are well aligned despite the differen-
ces in helix length. Presence of many gaps to obtain this
alignment results in very low alignment score, which
pushed down the pair below the top hits. Another exam-
ple where target (2fox) was found by using GA is query
protein 3chy from mainly-ab class; Figure 3(c) shows
superimposition of 3chy and 2fox from ProFit based on
global PB alignment. Once again identification of such
similarities is not possible by using LA techniques due to
the presence of variable regions and has to be accommo-
dated by gaps in an alignment. The above results show
that the possibility to align PB sequences using local or
GA techniques offers flexibility to recognize both strong
local similarity and distant (variable) global similarities
shared by proteins.
Figure 3Global alignment of PB sequences. (a) PB sequence alignment and superimposed structures for protein pair 1crb and 2 gfA. Target protein 2 gfA is found after using GA.
(b) PB sequence alignment and superimposed structures for protein pair 1bgeB and 2 gfA. GA fails to find target protein 2 gfA. (c) Superimposed structures of 3chy and
2fox based on GA of PB sequences. Structural alphabet notation is explained in de Brevern et al.37,38
Table VIIIPerformance of Local and Global Alignment Algorithm on Nontrivial Dataset
Queryprotein
Targetprotein
Localalignment
Globalalignment
Cut-off valueon GA alignment
1aep 256b:A 0 0 0 (20.29)2mta:C 1ycc 0 0 0 (20.74)1rcb 2gmf:A 0 1 (48)a 1 (10.02)1bge:B 2gmf:A 0 0 0 (20.34)2afn:A 1aoz:A 1 (10)b 0 0 (20.48)3hla:B 2rhe 0 0 1 (10.23)2aza:a 1paz 0 0 0 (20.29)1cew:I 1mol:A 0 1 (59)a 1 (20.06)1dsb 2trx:A 0 0 0 (20.68)1fxi:A 1ubq 1 (42) 1 (28)c 1 (10.26)3chy 2fox 0 1 (33)a 1 (10.18)1gpl 2trx:A 0 0 0 (21.94)1hip 2hip:A 1 (6) 1 (7) 1 (10.56)1isu:A 2hip:A 1 (59) 1 (15)c 1 (10.14)
In total, both methods were able to find seven target proteins. Success and failure
of GA and LA is indicated by 0 (failure) and 1 (success). Here, target rank is indi-
cated within parenthesis. Success of GA based on the application of a cut-off value
of 20.25 is also indicated by 0 if score <20.25 (failure) and 1 if score is >20.25
(success). Values within parentheses are the normalized alignment score (see text
for details).aTarget exclusively found by GA.bTarget protein missed by GA.cImprovement in rank using GA.
M. Tyagi et al.
930 PROTEINS
Protein 2afn was the only query where LA outper-
formed GA. Prime reason of GA failure can be under-
stood from the fact that query protein chain 2afnA con-
tain multiple domains, and GA against our database will
need to introduce large number of gaps resulting in low
alignment score, whereas in case of LA algorithm only
probe and target domains are aligned with high align-
ment score. These results indicate usage of GA algorithm
on sequence of PBs (encoding local structure variations)
is better suited to situations where structural similarities
are shared at global level and are difficult to obtain with
LA techniques. In situations, where protein chains are
suspected to contain multiple domains or one protein
structure is completely or partially contained in other
protein, LA approach proves more advantageous.
GA of few difficult pairs showed at global level that
the method was able to align equivalent regions in two
proteins, but due to the low scores, such pairs were
missed altogether by database search approach. We
extended above analysis by studying pairwise alignment
of tough cases and compared alignment results with flex-
ible alignment method called FATCAT. We selected 10
difficult pairs used by Ye et al. and compared PB-ALIGN
with VAST, DALI, CE, and FATCAT based on number of
residues aligned and superposition rmsd obtained. Table
IX shows number of residues aligned along with rmsd
values (within brackets) based on GA of PB sequences.
Other methods are compared based on the results
obtained by Ye et al. In our case, superposition of two
proteins was done using ProFit based on alignment pro-
vided by PB-ALIGN and further iterations were per-
formed by ProFit to obtain final results. In other words,
results presented in Table IX are combined effort of PB-
ALIGN and ProFit. To measure the contribution of
ProFit, we also performed another exercise where super-
position was performed based on sequence alignment
generated by ProFit to define initial equivalent zones and
carried out iteration from there on to get final values.
Results we obtained are very interesting, since ProFit
alone by itself gave comparable results to FATCAT and
PB-ALIGN in 7 out of 10 cases. Remaining three pairs
(1cewI, 1molA; 1cid_,2rhe_; 1crl_,1ede_) gave much
improved results when combined with PB-ALIGN. This
outcome has two implications: first, ProFit by itself is
good enough superimposition method to superimpose
protein pairs sharing low sequence similarity and can
achieve comparable results to more complex and robust
methods; second, in the cases where ProFit fails to find
optimal results by simple amino acid sequence alignment,
PB alignment provides good starting points to ProFit,
unidentifiable by sequence alignment alone. In all ten
cases, PB-ALIGN coupled with ProFit gave desirable
results compared to other complex methods. When com-
pared to a flexible alignment and superimposition
method FATCAT, PB-ALIGN gave low rmsd in most of
the cases with slightly less number of residues super-
posed. It is noteworthy that methods like FACAT has real
advantage in this study where it introduces twists in
structures to superimpose more residues with low rmsd
and despite this advantage our simple methodology gave
decent results. The only test case (pair 1crl_, 1ede_)
where PB-ALIGN produces significantly lower results
compared to FATCAT can be understood from the fact
that five twists were introduced in protein structure to
superimpose 269 residues with a rmsd of 3.55 A. In its
present form, though PB-ALIGN will align PB sequences
in a flexible manner, it is still not capable to produce
such results as it relies on rigid body superposition
method. Objective of this analysis was not to compete
with methods like FATCAT (which we believe in principle
will give better results specially in cases where twists are
needed to superpose structures) but it is to highlight, (i)
despite using very simple approach, PB alignment tech-
nique gives comparable results in most of the situations
with minimum computation time making it practical for
large scale analysis in real life situations and (ii) premises
Table IXComparison of PB-ALIGN and FATCAT
Protein 1 Protein 2 VAST DALI CE FATCAT PBALIGN ProFit
1fxiA 1ubq_ 48 (2.1) 60 (2.6)a 64 (3.8)a 63 (3.01) 59 (2.6) 55 (2.0)1ten_ 3hhrB 78 (1.6) 86 (1.9) 87 (1.9) 87 (1.9) 82 (4.1) 84 (4.0)3hlaB 2rhe_ — 63 (2.5) 85 (3.5) 79 (2.81)b 67 (2.4) 73 (2.6)2azaA 1paz_ 74 (2.2) 81 (2.5)a 85 (2.9) 87 (3.01) 79 (2.3) 79 (1.9)1cewI 1molA 71 (1.9) 81 (2.3) 69 (1.9) 83 (2.44) 74 (2.5) 16 (2.5)1cid_ 2rhe_ 85 (2.2) 95 (3.3) 94 (2.7) 100 (3.11) 87 (2.2) 25 (2.9)1crl_ 1ede_ — 211 (3.4) 187 (3.2) 269 (3.55)b 179 (2.3) 75 (2.9)2sim_ 1nsbA 284 (3.8) 286 (3.8) 264 (3.0) 286 (3.07) 262 (2.4) 264 (2.4)1bgeB 2gmfA 74 (2.5) 98 (3.5) 94 (4.1) 100 (3.19) 90 (2.4) 88 (2.4)1tie_ 4fgf_ 82 (1.7) 108 (2.0) 116 (2.9) 117 (3.05) 105 (2.2) 104 (2.1)
Comparison of various methods based on number of residues aligned and rmsd (within brackets) obtained for 10 difficult examples based on global alignment. Results
for other methods is taken from Ye et al.30
aNo results obtained in previous study done by Ye et al.30
bFATCAT introduced twists in structures to perform superposition.
Protein Structure Mining Using a Structural Alphabet
PROTEINS 931
for flexible structural superimposition as performed by
FATCAT are featured in the method of PB alignments
owing to the nature of the algorithm used.
Handling of multidomain proteins
PB-ALIGN was also tested on two multidomain
proteins, 2src_ and 2hckA (human Src and Hck kinase
proteins, respectively), to assess efficiency of method to
handle protein chains containing multiple domains dur-
ing database search. The database used in our case is a
collection of domains on SCOP classification; hence, it
was calculated whether the method is able to extract tar-
get domains among top hits. As seen above, LA tech-
nique has advantages over GA algorithm in such cases.
Hence, in the present analysis, we used LA algorithm to
extract different domains from database. Based on SCOP
classification, query proteins are composed of three dif-
ferent domains, namely SH3 domain, SH2 domain, and
protein kinases catalytic subunit. Our evaluation on mul-
tidomain proteins is slightly different from the earlier
studies,3,10 where success was measured if hits contained
all the four domains (based on CATH classification) fol-
lowed by proteins having two or one domain. In our
study, we assessed if all the domains (SH3 domain, SH2
domain, and protein kinases catalytic subunit based on
SCOP definition) are present among Top 100 hits. Previ-
ous studies reported that YAKUSA, DALI, VAST,
MATRAS, and CE gave best results while handling multi-
domain cases. SSM found proteins having all four
domains, and TOP and DEJAVU found structures sharing
more than one domain while having a blind eye to single
domain structures. LOCK managed to find representative
of each domain, but failed to find proteins having all
domains in single chain. TOPSCAN, PRIDE, and TOPS
were among least efficient methods. PB-ALIGN was able
to find all three target domains among top hits. SH2 and
kinases catalytic subunit domains were most easily found
and were populated among top hits. SH3 domain was
always found in at lower ranks (61st and 37th rank), and
Figure 4Distribution of normalized scores after PBs alignment between pairs of proteins from the same fold or super-family (top) and from different folds or super-families
(bottom).
M. Tyagi et al.
932 PROTEINS
this can be attributed to smaller length and high popula-
tion of other two domains in our database.
Cut-off threshold for PB-ALIGN scores torecognize common folds
On the basis of various assessment exercises described
above, we worked out a recommended threshold for the
PB-ALIGN scores that will allow one to designate a hit
in a structural database as same fold as that of the query.
Figure 4 provides a distribution of scores of PB-ALIGN
for the cases of common fold and different fold
(according to SCOP). This figure shows that a region of
scores is taken-up by proteins with the same fold as well
as different fold. As fold space is a continuum, it will be
difficult to have a precise score that will completely seg-
regate same folds from different folds. We hence analyzed
the variation of sensitivity and specificity for different
normalized score thresholds (see Fig. 5). Because we
want to typically minimize false positives while having an
acceptable level of sensitivity (true positives), we propose
to select appropriate cut-off at a stringent specificity
value of 0.95. Hence on the basis of the aforementioned
specificity value, the normalized score cut-off value was
20.250 and the sensitivity for proteins from the same
fold was of 0.75. Similarly, for proteins from the same
superfamily, the score cut-off value was 20.252 and a
sensitivity of 0.87. Hence, we suggest a threshold value of
20.250 to discriminate between proteins from the same
fold or same super-family. This cut-off works correctly
for demarcation of same folds or super-family from dif-
ferent folds or super-family although not for all the cases.
Going by this argument cut-off, a total of 75% and 87%
of the proteins with the same fold and super-family as
the query, respectively, are correctly picked-up by PB-
ALIGN. Importantly, rate of false hits is only of 5% in
both situations.
Understanding whether the observed PB sequence sim-
ilarity is just a chance event is the central problem for
the evaluation of the statistical significance of alignment
scores. The basic question to be answered is what is the
probability that a similarity score as great as that actually
observed between real sequences could have arisen by
chance, when sampling from suitably-defined populations
of unrelated sequences? To address this question, the dis-
tribution of GA scores from real but unrelated sequences
Figure 5Analysis of variation of sensitivity and specificity according to different cut-off scores.
Table XEstimates of Extreme Value Distribution Parameters for Alignment Scores
Between Real but Unrelated Sequences of Different Sequence Length Subsets
(See Also Fig. 7)a
Length(number ofresidues)
Shapeparameter
(n)
Scaleparameter
(r)
Locationparameter
(l)
40 0.376 0.673 0.99370 0.359 0.539 0.831100 0.368 0.506 0.774130 0.349 0.418 0.685160 0.397 0.353 0.621190 0.406 0.384 0.644220 0.416 0.354 0.599250 0.424 0.358 0.535280 0.444 0.337 0.480310 0.402 0.261 0.519340 0.435 0.278 0.540370 0.456 0.262 0.520400 0.477 0.242 0.499
aThese parameters were derived using gev function in evir package implemented
in R statistical software.51
Protein Structure Mining Using a Structural Alphabet
PROTEINS 933
for different length subsets (40-aa through 400-aa by 30-
aa increment, see also Table X) need to be analyzed. As
shown for three examples in Figure 6, alignment scores
were distributed according to an extreme value distribu-
tion (EVD). We hence derived all three corresponding
EVD parameters (Table X), which could be used to mea-
Figure 6Distribution of scores from global alignments of real but unrelated sequences (RUSs) datasets of 40-aa, 190-aa, and 400-aa long. The distribution of the scores was
estimated with extreme value distribution curve indicated in solid line using evir package from R statistical software.51 On the right are displayed the corresponding
quantile plots.
M. Tyagi et al.
934 PROTEINS
sure confidence of alignment scores to classify two pro-
teins as having the same fold or belonging to same
super-family. The scale parameter (r) linearly decreased
with length alignment, which further comforts the
inferred EVD distribution model (see Fig. 7).
CONCLUSIONS
In the era of structural genomics, protein structure
comparison and mining plays an important role in com-
putational biology. Identification of new phylogenic rela-
tionships, functional annotation, and study of sequence
to structure relationships are some of its most common
targets. In this study, we have presented a new structure
mining method called PB-ALIGN, based on the encoding
of protein backbone as sequence of short local motifs
(PBs) and their alignment using a newly derived PB sub-
stitution matrix and simple dynamic programming. The
method is simple and is scalable for large scale analysis
and provides an alternative tool for structural genomics.
Use of local structural features (PBs) to describe protein
backbone and alignment of such features provides an al-
ternative to previously known methods like DALI and
CE for mining protein domains. Existing methods rely
on SSEs’ information for structure alignment and misses
out on large amount of structural information beside
regular structures in proteins. PB representation of pro-
tein backbone highlights subtle variations and structural
conservations present beyond local regular structures, for
example, N and C caps of helices and strands. Capability
of PB-ALIGN to align these regions is a step ahead of
existing methods.
The method is performing reasonably well in mining
structurally similar proteins from large database at both
fold and super-family level. Among peer comparison,
PB-ALIGN stood out at par with other methods to mine
structures and at best in terms of speed. Ability to obtain
good mining capacity at high speed highlights the sim-
plicity and effectiveness of the method. Generally speak-
ing, the efficiency of PB ALIGN as a pairwise comparison
tool still needs to be assessed in an in-depth benchmark
in comparison with extensively tested, robust, and stand-
ard methods such as CE, DALI, and VAST. Nevertheless,
compared to methods like YAKUSA that also relies on
representation of local structural features, our method
performed superior on general test data and at par on
difficult (nontrivial) dataset. The main difference between
these two methods is the final objective. YAKUSA aims
to locate strong gap-free local structural similarities or
‘‘blocks’’10 between two proteins and is not concerned
with global similarities spread over protein length,
whereas PB-ALIGN despite using local structural features
aims to address both local and global similarity between
proteins. Availability of PB substitution matrix and use
of local or global sequence alignment techniques help to
answer both local and global structure similarities. Both
alignment techniques are shown to be useful in different
conditions, for example, LA is beneficial if one wants to
identify strong local similarities or if protein chain is
multidomain or if one protein is completely or partially
included in other structure. GA is useful if two proteins
share structural similarities spread across protein length.
Our experience suggests that user should use both local
and GA feature, and manual inspection of PB alignments
will clearly highlight the best approach. Another advant-
age we found in PB-ALIGN is intuitive nature of PB
alignment representing structure alignment, and many
times, simple inspection of alignment gives hint about
structural differences between proteins prior to 3D visu-
alization.
Importantly, this study derived a score cut-off, for the
inference of structural similarity between structural
domains whose relationship is unknown using their PB
representations. It further specified the extreme value
distribution of GA scores of real but unrelated PB
sequences. This information is currently being used to
implement a statistical significance threshold in PB-ALIGN.
Comprehensive assessment of our methodology also
highlighted some shortcomings and need of further fine-
tuning of PB-ALIGN. At present, we have used simple
dynamic programming and linear gap penalty. We believe
use of more robust flavor of dynamic programming and
change in gap penalty will further improve the alignment
quality. Artificial increase in alignment score due to long
stretch of regular structures especially in a class proteins
Figure 7Variation of estimates of EVD scale parameter (r) with length of protein
sequences calculated from global alignments of real but unrelated sequences
(RUSs) datasets (see Table X). Fitted regression line with R2 5 0.85
(P < 0.0001) as shown in solid line was calculated using lm function from
R statistical software.51
Protein Structure Mining Using a Structural Alphabet
PROTEINS 935
is also being looked into and change in scoring function
is anticipated. Furthermore, we would like to introduce
combined scoring function taking into account number
of aligned residues, rmsd value, and alignment score. We
believe that the use of PB alignment methodology to per-
form multiple alignment of family members would ena-
ble use to define ‘‘core’’ structures having boundaries
beyond SSEs and would help in finding distant homo-
logues. PB-ALIGN is also expected to be useful in
homology modeling and loop modeling.
ACKNOWLEDGMENTS
We thank informatics department for their support
during the access of super computer facility. We thank
anonymous reviewers for their fruitful comments. N.S. is
a senior fellow of the Wellcome Trust, London and is a
visiting Professor to the Reunion University.
REFERENCES
1. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig
H, Shindyalov IN, Bourne PE. The protein data bank. Nucleic Acids
Res 2000;28:235–242.
2. Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural
classification of proteins database for the investigation of sequences
and structures. J Mol Biol 1995;247:536–540.
3. Novotny M, Madsen D, Kleywegt GJ. Evaluation of protein fold
comparison servers. Proteins 2004;54:260–270.
4. Kolodny R, Koehl P, Levitt M. Comprehensive evaluation of protein
structure alignment methods: scoring by geometric measures. J Mol
Biol 2005;346:1173–1188.
5. Perutz MF. Structure of hemoglobin. Brookhaven Symp Biol
1960;13:165–183.
6. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thorn-
ton JM. CATH—a hierarchic classification of protein domain struc-
tures. Structure 1997;5:1093–1108.
7. Shindyalov IN, Bourne PE. An alternative view of protein fold
space. Proteins 2000;38:247–260.
8. Tyagi M, Sharma P, Swamy CS, Cadet F, Srinivasan N, Brevern AG,
Offmann B. Protein block expert (PBE): a web-based protein struc-
ture analysis server using a structural alphabet. Nucl Acids Res
2006;34:W119–W123.
9. Holm L, Sander C. Protein structure comparison by alignment of
distance matrices. J Mol Biol 1993;233:123–138.
10. Carpentier M, Brouillet S, Pothier J. YAKUSA: a fast structural
database scanning method. Proteins 2005;61:137–151.
11. Krissinel E, Henrick K. Secondary-structure matching (SSM), a new
tool for fast protein structure alignment in three dimensions. Acta
Crystallogr D Biol Crystallogr 2004;60(Pt 12 Pt 1):2256–2268.
12. Blades MJ, Ison JC, Ranasinghe R, Findlay JBC. Automatic genera-
tion and evaluation of sparse protein signatures for families of pro-
tein structural domains. Protein Sci 2005;14:13–23.
13. Miguel RN. Sequence patterns derived from the automated predic-
tion of functional residues in structurally-aligned homologous pro-
tein families. Bioinformatics 2004;20:2380–2389.
14. Domingues FS, Lackner P, Andreeva A, Sippl MJ. Structure-based
evaluation of sequence comparison and fold recognition alignment
accuracy. J Mol Biol 2000;297:1003–1013.
15. Sauder JM, Arthur JW, Dunbrack RL, Jr. Large-scale comparison of
protein sequence alignment algorithms with structure alignments.
Proteins 2000;40:6–22.
16. Friedberg I, Kaplan T, Margalit H. Evaluation of PSI-BLAST align-
ment accuracy in comparison to structural alignments. Protein Sci
2000;9:2278–2284.
17. Escalier V, Pothier J, Soldano H, Viari A. Pairwise and multiple
identification of three-dimensional common substructures in pro-
teins. J Comput Biol 1998;5:41–56.
18. Shindyalov IN, Bourne PE. Protein structure alignment by incre-
mental combinatorial extension (CE) of the optimal path. Protein
Eng 1998;11:739–747.
19. Taylor WR, Orengo CA. Protein structure alignment. J Mol Biol
1989;208:1–22.
20. Levine M, Stuart D, Williams J. A method for the systematic com-
parison of the three-dimensional structures of proteins and some
results. Acta Crystallogr 1984;40:600–610.
21. Usha R, Murthy MR. Protein structural homology: a metric
approach. Int J Pept Protein Res 1986;28:364–369.
22. Singh AP, Brutlag DL. Hierarchical protein structure superposition
using both secondary structure and atomic representations. Proc
Int Conf Intell Syst Mol Biol 1997;5:284–293.
23. Lu G. TOP: a new method for protein structure comparisons and
similarity searches. J Appl Crystallogr 2000;33:176–183.
24. Gibrat JF, Madej T, Bryant SH. Surprising similarities in structure
comparison. Curr Opin Struct Biol 1996;6:377–385.
25. Harrison A, Pearl F, Mott R, Thornton J, Orengo C. Quantifying
the similarities within fold space. J Mol Biol 2002;323:909–926.
26. Zuker M, Somorjai RL. The alignment of protein structures in three
dimensions. Bull Math Biol 1989;51:55–78.
27. Subbiah S, Laurents DV, Levitt M. Structural similarity of DNA-
binding domains of bacteriophage repressors and the globin core.
Curr Biol 1993;3:141–148.
28. Vriend G. WHAT IF: a molecular modeling and drug design pro-
gram. J Mol Graph 1990;8:52–56, 29.
29. Shatsky M, Nussinov R, Wolfson HJ. Flexible protein alignment
and hinge detection. Proteins 2002;48:242–256.
30. Ye Y, Godzik A. Flexible structure alignment by chaining aligned
fragment pairs allowing twists. Bioinformatics 2003;19 (Suppl
2):II246–II255.
31. Offmann B, Tyagi M, de Brevern AG. Local protein structures. Curr
Bioinf 2007;2:1–38.
32. Jones TA, Thirup S. Using known substructures in protein model
building and crystallography. EMBO J 1986;5:819–822.
33. Unger R, Harel D, Wherland S, Sussman JL. A 3D building blocks
approach to analyzing and predicting structure of proteins. Proteins
1989;5:355–373.
34. Levitt M. Accurate modeling of protein conformation by automatic
segment matching. J Mol Biol 1992;226:507–533.
35. Unger R, Sussman JL. The importance of short structural motifs in
protein structure analysis. J Comput AidedMol Des 1993;7:457–472.
36. Yang JM, Tung CH. Protein structure database search and evolu-
tionary classification. Nucleic Acids Res 2006;34:3646–3659.
37. de Brevern AG. New assessment of a structural alphabet. In Silico
Biol 2005;5:283–289.
38. de Brevern AG, Etchebest C, Hazout S. Bayesian probabilistic
approach for predicting backbone structures in terms of protein
blocks. Proteins 2000;41:271–287.
39. Tyagi M, Gowri VS, Srinivasan N, de Brevern AG, Offmann B. A sub-
stitution matrix for structural alphabet based on structural alignment
of homologous proteins and its applications. Proteins 2006;65:32–39.
40. Dudev T, Lim C. Effect of carboxylate-binding mode on metal
binding/selectivity and function in proteins. Acc Chem Res
2007;40:85–93.
41. Tyagi M, Sharma P, Swamy CS, Cadet F, Srinivasan N, de Brevern
AG, Offmann B. Protein Block Expert (PBE): a web-based protein
structure analysis server using a structural alphabet. Nucleic Acids
Res 2006;34 (Web Server issue):W119–W123.
42. Smith TF, Waterman MS. Identification of common molecular sub-
sequences. J Mol Biol 1981;147:195–197.
M. Tyagi et al.
936 PROTEINS
43. Needleman SB, Wunsch CD. A general method applicable to the
search for similarities in the amino acid sequence of two proteins.
J Mol Biol 1970;48:443–453.
44. Schuchhardt J, Schneider G, Reichelt J, Schomburg D, Wrede P.
Local structural motifs of protein backbones are classified by
self-organizing neural networks. Protein Eng 1996;9:833–842.
45. Tyagi M, Venkataraman SG, Srinivasan N, de Brevern AG, Offmann
B. A substitution matrix for structural alphabet based on structural
alignment of homologous proteins and its applications. Proteins
2006;65:32–39.
46. Balaji S, Sujatha S, Kumar SS, Srinivasan N. PALI—a database of
Phylogeny and ALIgnment of homologous protein structures.
Nucleic Acids Res 2001;29:61–65.
47. Gowri VS, Pandit SB, Karthik PS, Srinivasan N, Balaji S. Integration
of related sequences with protein three-dimensional structural fami-
lies in an updated version of PALI database. Nucleic Acids Res
2003;31:486–488.
48. Russell RB, Barton GJ. Multiple protein sequence alignment from
tertiary structure comparison: assignment of global and residue
confidence levels. Proteins 1992;14:309–323.
49. McLachlan AD. Rapid comparison of protein structures. Acta Crys-
tallogr A Fundam Crystallogr 1982;38:871–873.
50. Lu G. TOP: a new method for protein structure comparisons and
similarity search. J Appl Crystallogr 2000;33:176–183.
51. Ihaka R, Gentleman R. A language for data analysis and graphics.
J Comput Graph Stat 1996;5:299–314.
Protein Structure Mining Using a Structural Alphabet
PROTEINS 937