Motif Analyzer for protein 3D structures

6
Motif Analyzer for protein 3D structures Evgeniy Aksianov Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Leninskye Gory, 1/40, 119992 Moscow, Russia article info Article history: Received 4 December 2013 Accepted 26 February 2014 Available online 4 March 2014 Keywords: Protein structure Structural motif All-b- and a/b-structures Structural motif detector abstract The topology of the protein structure of all-b- or a/b-class is a special arrangement of b-strands within b-sheets (and a-helices surrounding b-sheets) and the order of them along the polypeptide chain. Struc- tural motifs are a subset of strands and/or helices with widely spread topology. Structural motifs are used for classification of protein structure. Because of an increasing variety of known structures, an automatic tool for motif detection is needed. MotAn is an algorithmic detector of structural motifs in a given 3D pro- tein structure. It detects b-hairpins, b-meanders, b-helices, Greek keys, interlocks, jellyrolls, b-a-b-motifs and b-a-b-helices. MotAn was tested on selected SCOP families and shown to be more sensitive detector than the PTGL and PROMOTIF programs. MotAn is available at http://mouse.belozersky.msu.ru/motan. Ó 2014 Elsevier Inc. All rights reserved. 1. Introduction Polypeptide chains are folded into secondary structural ele- ments (a-helices and b-strands) and loops between them. All sec- ondary structural elements (SSE) are ordered along polypeptide chain. b-Strands form b-sheets – arrays of differently arranged strands. Generally, the order of strands along the polypeptide chain and their arrangement within sheets differs. A combination of strand order and arrangement is known as topology (see below for definition details) (Orengo et al., 1997). Many topologies were observed in the known structures (Gordeev et al., 2010; Zhang and Kim, 2000). It was noted that different topologies can share the common parts, known as struc- tural motifs or supersecondary structure (Chakrabarti et al., 2003; Guruprasad et al., 2000; Christine et al., 1993; Kister et al., 2002). For example, region 63–94 from chain A in PDB: 1PKJ and region 1–32 from chain A in PDB: 1J3A have the same topology formed by two paired parallel strands and one helix in the following order: strand–helix–strand (a so-called right-handed b-a-b-motif). Structural motifs are used for protein structure description and classification. For example, Greek key structural motif is noted as a characteristic feature for in 40 SCOP 1.75 (Murzin et al., 1995) folds. In many cases definition of the motif is debatable. Due to increasing numbers of known protein 3D structures we are looking for objective motif definition in form of an automatic tool for detecting structural motifs. Here the program called MotAn (Motif Analyzer) is introduced. Its input is a 3D structure in PDB format, and the output is a list of detected motifs. Currently, MotAn detects all-b and a/b-motifs. All-a motifs are not detected. The full list of the detectable motifs is follows: b-hairpins, b-meanders, b-helices, Greek keys, interlocks, jellyrolls, b-a-b-motifs and b-a-b-helices. MotAn is available at http://mouse.belozersky.msu.ru/motan as a web-service, source code and downloadable versions for Windows and Linux platforms. The program was tested on a subset of SCOP domains and com- pared with PTGL database (May et al., 2004, 2010). It was shown that MotAn gives better results when compared to PTGL. For test we select several SCOP families with a clear motif annotations. For example, jellyrolls are noted for a SCOP fold b.121, containing 328 individual domains. Jellyrolls and similar motifs were detected correctly in 80 structures by PTGL and in 265 by MotAn. Similar results were obtained for the other motifs. 2. Materials and methods 2.1. Definitions 2.1.1. Helices and strands identifiers a-Helices are denotes as h 0 ,h 1 ,... Indexes indicate the order of helices along polypeptide chain. b-Sheets are denoted by letters A, B, etc. Two orthogonal direc- tions – ‘‘top-to-bottom’’ and ‘‘left-to-right’’ – are defined on the every sheet. See Fig. 1 for explanation of these terms. There are two directions on the sheet, which can be denoted as ‘‘top-to-bot- tom’’ (‘‘left-to-right’’), one of them is selected (see Section 2.1.2 for description, which direction must be chosen). http://dx.doi.org/10.1016/j.jsb.2014.02.017 1047-8477/Ó 2014 Elsevier Inc. All rights reserved. Abbreviation: SSE, secondary structural element. E-mail address: [email protected] Journal of Structural Biology 186 (2014) 62–67 Contents lists available at ScienceDirect Journal of Structural Biology journal homepage: www.elsevier.com/locate/yjsbi

Transcript of Motif Analyzer for protein 3D structures

Page 1: Motif Analyzer for protein 3D structures

Journal of Structural Biology 186 (2014) 62–67

Contents lists available at ScienceDirect

Journal of Structural Biology

journal homepage: www.elsevier .com/ locate/y jsbi

Motif Analyzer for protein 3D structures

http://dx.doi.org/10.1016/j.jsb.2014.02.0171047-8477/� 2014 Elsevier Inc. All rights reserved.

Abbreviation: SSE, secondary structural element.E-mail address: [email protected]

Evgeniy AksianovBelozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Leninskye Gory, 1/40, 119992 Moscow, Russia

a r t i c l e i n f o

Article history:Received 4 December 2013Accepted 26 February 2014Available online 4 March 2014

Keywords:Protein structureStructural motifAll-b- and a/b-structuresStructural motif detector

a b s t r a c t

The topology of the protein structure of all-b- or a/b-class is a special arrangement of b-strands withinb-sheets (and a-helices surrounding b-sheets) and the order of them along the polypeptide chain. Struc-tural motifs are a subset of strands and/or helices with widely spread topology. Structural motifs are usedfor classification of protein structure. Because of an increasing variety of known structures, an automatictool for motif detection is needed. MotAn is an algorithmic detector of structural motifs in a given 3D pro-tein structure. It detects b-hairpins, b-meanders, b-helices, Greek keys, interlocks, jellyrolls, b-a-b-motifsand b-a-b-helices. MotAn was tested on selected SCOP families and shown to be more sensitive detectorthan the PTGL and PROMOTIF programs. MotAn is available at http://mouse.belozersky.msu.ru/motan.

� 2014 Elsevier Inc. All rights reserved.

1. Introduction

Polypeptide chains are folded into secondary structural ele-ments (a-helices and b-strands) and loops between them. All sec-ondary structural elements (SSE) are ordered along polypeptidechain. b-Strands form b-sheets – arrays of differently arrangedstrands. Generally, the order of strands along the polypeptide chainand their arrangement within sheets differs. A combination ofstrand order and arrangement is known as topology (see belowfor definition details) (Orengo et al., 1997).

Many topologies were observed in the known structures(Gordeev et al., 2010; Zhang and Kim, 2000). It was noted thatdifferent topologies can share the common parts, known as struc-tural motifs or supersecondary structure (Chakrabarti et al., 2003;Guruprasad et al., 2000; Christine et al., 1993; Kister et al., 2002).For example, region 63–94 from chain A in PDB: 1PKJ and region1–32 from chain A in PDB: 1J3A have the same topology formedby two paired parallel strands and one helix in the following order:strand–helix–strand (a so-called right-handed b-a-b-motif).

Structural motifs are used for protein structure description andclassification. For example, Greek key structural motif is noted as acharacteristic feature for in 40 SCOP 1.75 (Murzin et al., 1995)folds. In many cases definition of the motif is debatable. Due toincreasing numbers of known protein 3D structures we are lookingfor objective motif definition in form of an automatic tool fordetecting structural motifs.

Here the program called MotAn (Motif Analyzer) is introduced.Its input is a 3D structure in PDB format, and the output is a list ofdetected motifs. Currently, MotAn detects all-b and a/b-motifs.All-a motifs are not detected. The full list of the detectable motifsis follows: b-hairpins, b-meanders, b-helices, Greek keys, interlocks,jellyrolls, b-a-b-motifs and b-a-b-helices. MotAn is available athttp://mouse.belozersky.msu.ru/motan as a web-service, sourcecode and downloadable versions for Windows and Linux platforms.

The program was tested on a subset of SCOP domains and com-pared with PTGL database (May et al., 2004, 2010). It was shownthat MotAn gives better results when compared to PTGL. For testwe select several SCOP families with a clear motif annotations.For example, jellyrolls are noted for a SCOP fold b.121, containing328 individual domains. Jellyrolls and similar motifs were detectedcorrectly in 80 structures by PTGL and in 265 by MotAn. Similarresults were obtained for the other motifs.

2. Materials and methods

2.1. Definitions

2.1.1. Helices and strands identifiersa-Helices are denotes as h0, h1,. . . Indexes indicate the order of

helices along polypeptide chain.b-Sheets are denoted by letters A, B, etc. Two orthogonal direc-

tions – ‘‘top-to-bottom’’ and ‘‘left-to-right’’ – are defined on theevery sheet. See Fig. 1 for explanation of these terms. There aretwo directions on the sheet, which can be denoted as ‘‘top-to-bot-tom’’ (‘‘left-to-right’’), one of them is selected (see Section 2.1.2 fordescription, which direction must be chosen).

Page 2: Motif Analyzer for protein 3D structures

Fig.1. Map of the b-sheet. See the reference for the SheeP program (Aksianov andAlexeevski, 2012) for details. (A) Backbone of the b-sheet from PDB: 4IX9, chain A.Ca-atoms are indicated by spheres. Residue numbers are shown. Hydrogen bondsare indicated by dotted lines. Covalent and hydrogen bonds between backboneatoms forms cells of the table. (B) Simplified representations of the b-sheet, called‘‘sheet map’’. Map is a table of which the rows correspond to strands and the cellscorrespond to amino acids. This scheme repeats the structure of covalent andhydrogen bonds between backbone atoms shown on the figure A. Rows of the mapare ordered from top to bottom. Columns are ordered from left to right. Note thatsheet orientation on figure A is meaningless, so the order of rows and columns inthe map can be flipped.

E. Aksianov / Journal of Structural Biology 186 (2014) 62–67 63

Every strand is characterized with its ‘‘N-to-C’’ direction whichis orthogonal to ‘‘top-to-bottom’’ direction of the sheet. Strands canbe directed ‘‘from left to right’’ or ‘‘from right to left’’. Pairs ofparallel and antiparallel strands have same and different directionsrespectively. Strands of the sheet are arranged from ‘‘top’’ to‘‘bottom’’. The arrangement of strands in close b-barrels iscircularly permutable.

Strands are denoted as A0+, B1

� etc. Letters A, B,. . ., Z are names ofcorresponding sheets. Indexes indicate strand order from the ‘‘top’’

A

B

Fig.2. Protein topologies and structural motifs. Strands are shown by arrows, helices by cb-a-sandwich, where a sheet of 3 strands is surrounded by 4 a-helices. Helix h0 is in coordered according arrangement of strands in sheet (the strand imaged upper on the figurall strands are directed from left to right side of the picture – they are marked by signspolypeptide chain is A1

+–h0–A0+–h1–A2

+–h2–h3–A3+. The region A1

+–h0–A0+ is a motif called ‘

sheets (A and B) with one small additional b-sheet named C. Strands of different sheetsstrands in the polypeptide chain is A3

+0–B4+–A4

�–A3+00–B3

�–C1+–A1

+–B1�–B0

+–A0�–C0

�–B2+–A2

�. NoteB2

+ and A2� form a motif called ‘‘jellyroll’’, strands B4

+, A4�, A3

+00 and B3– form ‘‘Greek key’’. Stra

2.1.4 and 3 for details).

(index is 0) to the ‘‘bottom’’ (index is maximal) of the sheet. Signsindicate direction of strand in the sheet: (+) means ‘‘left-to-right’’,(�) means – ‘‘right-to-left’’. Indexes 0, 1, 2,. . ., n of strands from anysheet can be changed to n, n�1,. . ., 0 without changing the sense.Changing of the all signs is also meaningless. Notation ‘‘Ai–Bj’’means that strands Ai and Bj are consequent and Ai precedes Bj inpolypeptide chain. Fig. 2 illustrates this notation.

2.1.2. Sheet–sheet and sheet–helices contactsIn this work ‘‘top-to-bottom’’ and ‘‘left-to-right’’ directions of

contacted sheets are chosen correspondently. For example onFig. 2B ‘‘top’’ strands A0

� and B0+ are on the same edge of the sand-

wich, also as ‘‘bottom’’ strands. Also, ‘‘left-to-right’’ directions oftwo sheets are selected correspondently (for example, strands A0

+

and B0+ are parallel neither than antiparallel).

2.1.3. Topology and motifsTopology of the protein (or its subpart) is (1) the order of strands

in sheets denoted by strands’ names as described above, (2) the or-der of SSEs in polypeptide chain and (3) the sides of all sheet–sheetand sheet–helices contacts. Two examples of topology are shown onthe Fig. 2A and B. ‘‘Top-to-bottom’’ and ‘‘left-to-right’’ directions ofthe b-sheet map can be selected by two different ways, so thereare several possible signatures for every topology. For example, forthe structure PDB: 4IX9 there are 4 possible (Fig. 2A) signatures:

(1) A2+–h0–A3

+–h1–A1+–h2–h3–A0

+,(2) A1

+–h0–A0+–h1–A2

+–h2–h3–A3+,

SSE Number in chain

SSE bounds

β-sheetA0

+ 3 Asn36 – Val39A1

+ 1 Leu7 – Ala12A2

+ 5 Ala65 – Leu68A3

+ 8 Ala90 – Leu92Helix below the sheet

h0 2 Glu14 – Leu22Helices over the sheet

h1 4 Lys47 – Glu59h2 6 Gln71 – Asn76h3 7 Arg78 – Asp82

Strand Number in chain

Strand bounds

Sheet AA0

– 10 Trp95 – Asn103A1

+ 7 Ser54 – Ser63A2

– 13 Trp120 – Lys129A3

+’ 1 Met1 – Gln7A3

+’’ 4 Asn30 – Gly32A4

– 3 Gln18 – Glu20Sheet B

B0+ 9 Pro79 – Ile86

B1– 8 Gly68 – Glu74

B2+ 12 Phe111 – Ala117

B3– 5 Asp38 – Tyr42

B4+ 2 Ser12 – Ser15

Sheet CC0

– 11 Gly107 – His109C1

+ 6 Val47 – Ile49

ylinders, loops by bold lines. (A) Topology of PDB: 4IX9, chain A. This structure is a-ntact with one side of the sheet, h1–h3 – with the other one. Indexes of strands aree has the smaller index). Signs indicate the direction of the strands (in this structure+). Helices’ indexes are ordered along the polypeptide chain. The order of SSE in the‘b-a-b-motif’’. (B) The structure of PDB: 1UY0, chain A is a b-sandwich of two coreare indicated by colors. Sheets A and B are contacted by their surfaces. The order ofthat strands A3

+0 and A3+00 are denoted by the same indexes. Strands B3

�, A1+, B1

�, B0+, A0

�,nds C0

+ and C1� are phantoms inserted between strands of jellyroll motif (see Sections

Page 3: Motif Analyzer for protein 3D structures

Fig.3. Web interface of the MotAn program (http://mouse.belozersky.msu.ru/motan). Reports of the MotAn on the structure PDB: 1LUC, chain A are shown.The upper links allows to run Jmol viewer and download the results in XML-basedformat (XMAP), HTML format (Reports) or as a script for visualization in RasMol/Jmol (Sayle and Milner-White, 1995; http://www.jmol.org/). The topologies of theevery motif and of the whole chain are given. Phantoms are indicated by the greycolor. All motif, strands and helixes identifiers are clickable – they are used forvisualizing the element in Jmol.

64 E. Aksianov / Journal of Structural Biology 186 (2014) 62–67

(3) A2�–h0–A3

�–h1–A1�–h2–h3–A0

� and(4) A1

�–h0–A0�–h1–A2

�–h2–h3–A3�.

A motif is a subset (or several subsets) of consequent strandsand/or helices and loops between them with topology widelyspread in known protein structures. All motifs can be formallydefined in terms of strand and helix identifiers introduced above.Two motifs k (of SSEs k0, k1, etc.) and l (of SSEs l0, l1, etc.) arecalled concatenated if they share at least one SSE and the every loopbetween SSE ki and lj is a part of k or l.

MotAn detects the following motifs:

A. Simple motifs

a. b-hairpin: {Ai+–Ai+1

– }.b. b-a-b-motif: {Ai

+–hj–hj+1. . .–Ai+1+ }. Several helices are

allowed. Can be either right- or left-handed.c. Interlocks: motifs of two pairs of consequent strands:

{Ai+1+ –Bj+1

� and Ai�–Bj+2

+ } (indexes j + 1 and j + 2 instead of jand j + 1 correspondently are used for unification with jel-lyroll definition below). See (Kister et al., 2002) for motifdescription. Can be either right- or left-handed (see‘‘Jellyroll’’ section for explanation).

d. Single sheet Greek key: {Ai�1� –Ai+2

+ –Ai+1� –Ai

+} or {Ai+–Ai+1

� –Ai+2

+ –Ai�1� }

e. Greek key of two sheets: {Ai+1+ –Bj+1

� –Bj+–Ai

�}.f. b-helix: {Ai

+–Bj+–Ck

+–Ai+1+ –Bj+1

+ –Ck+1+ –. . .}. Helices of 2 or 3

sheets are currently allowed.B. Concatenations of single motifs

a. b-meander: {Ai+–Ai+1

� –Ai+2+ –...} (concatenation of hairpins

{Ai+–Ai+1

� }, {Ai+1� –Ai+2

+ },. . .)b. b-a-b-helix: {Ai

+–hj–Ai+1+ –hj+1–Ai+2

+ –. . .} (concatenation ofb-a-b-motifs {Ai

+–hj–Ai+1+ }, {Ai+1

+ –hj+1–Ai+2+ },. . .). Can be

either right- or left-handed.c. Jellyroll: two-stranded ribbon {. . .–Bj+3

� –Ai+1+ –Bj+1

� –Bj+–Ai

�–Bj+2

+ –Ai+2� –. . .} rolled in helix. Jellyrolls are concatenations

of several interlocks and usually a Greek key (seeFig. S.2). In the current work at least one turn of the helixis required for a jellyroll. Can be either right- or left-handed. Interlock called right-(left-)handed if concatena-tion of it with its copy gives right-(left-)handed jellyroll.

d. Interlock with an additional strand: { Bj+3� –Ai+1

+ –Bj+1� and

Ai�–Bj+2

+ }e. Interlock, concatenated with a Greek key: {Ai+1

+ –Bj+1� –Bj

+–Ai�–Bj+2

+ } (concatenation of the interlock {Ai+1+ –Bj+1

� ; Ai�–

Bj+2+ } with a Greek key {Ai+1

+ –Bj+1� –Bj

+–Ai�}).

Detailed schemes of all these motifs are available in theSupplementary materials (see Fig. S.1). Relations between inter-locks and their concatenations with other interlocks and Greekkeys are described in Supplementary data and on the Fig. S.2.

2.1.4. Core and phantomsMany structures are composed of the core and several phantom

elements. Cores are an evolutionarily conserved part of the struc-ture, shared between related proteins. Structural classifications,like SCOP, CATH and PCBOST are based on the structure of core ele-ments. For example, the domain presented in Fig. 2B is annotatedas a jellyroll containing b-sandwich both in SCOP and CATH. Soonly sheets A and B were taken into account to create those anno-tations. We call such main structural elements a core. Small sheet Cis not a part of the core, we call it a phantom. There is no formal def-inition of cores and phantoms, we use these terms to distinguishbetween those elements of the structure which can be a part ofthe structural motifs and those which can be small insertions in

the loops of motifs. The algorithm of the motif detector must con-tain a detector for cores and phantoms, optimized to coincide withan expert annotations (for example, from SCOP).

2.2. Algorithm

The MotAn algorithm undergoes the following steps:

1. Secondary structural elements are detected by the SheeP pro-gram (Aksianov and Alexeevski, 2012). Its results contain thefull information about strand boundaries, arrangement of thestrands in sheets and the order of strands in polypeptide chains.Sides of the sheets are also detected.

2. Contacts between sheets and/or helices are detected by specialprogram called ArchiP (see http://mouse.belozersky.msu.ru/archip) which will be described later. For every sheet–sheet orsheet–helix contact the contacted side(s) of the sheet(s) is(are) determined.

3. Some SSEs are marked as small. Small SSEs are expected to bephantoms.

4. All motifs are detected by enumeration. Small strands andhelixes are allowed in the loops between motif’s SSEs. Theyare marked as phantoms for this motif.

See Supplementary data for algorithm details.

2.3. Material

All 3D structures were downloaded from the Protein Data Bank(Berman et al., 2003). The SCOP 1.75 database (Murzin et al., 1995)was used to determine domain boundaries. The PTGL database(May et al., 2010) was used for comparative testing; its contentwas downloaded from http://ptgl.uni-frankfurt.de/ptgl.html onJune 07, 2013. Results of PROMOTIF program (Hutchinson andThornton, 1996) were obtained from EBI website (http://www.e-bi.ac.uk/pdbsum/) at October 2012.

Several SCOP divisions were used for testing (see Table 1).Structural motifs expected to be found in the domains of those

Page 4: Motif Analyzer for protein 3D structures

E. Aksianov / Journal of Structural Biology 186 (2014) 62–67 65

families, superfamilies and folds were obtained from SCOP annota-tions. For comparing with PTGL and ProMotif programs subsets ofthose families containing the domains of the whole chain were se-lected (it was done because in databases with PTGL and ProMotifmotifs are assigned to PDB chain, not to the SCOP domain).

3. Results

The MotAn program was created using the FreePascal languageand shared at http://mouse.belozersky.msu.ru/motan. The webservice allows to upload user PDB file, run the SheeP and ArchiPprograms (to detect b-sheets and contacts between sheets and/orhelices) for uploaded structure and to detect structural motifsusing MotAn. Results reported on the webpage can be visualizedusing Jmol program or downloaded in XML format. A download-able version of the program is also available. An output of theMotAn service is shown on the Fig. 3.

3.1. Examples

In structure PDB: 4IX9, chain A (see Fig. 2A) MotAn detects twob-a-b-motifs: {A1

+–h0–A0+} and {A2

+–h2–h3–A3+}. In structure PDB:

1UYO, chain A (see Fig. 2B) MotAn detects a jellyroll {A3+00–B3

�–A1+–

B1�–B0

+–A0�–B2

+–A2�}. Strands C1

+ and C0� are marked as phantoms.

3.2. Comparing with PTGL and ProMotif

Results of MotAn were compared with the PTGL database (Mayet al., 2010) for some SCOP families with defined structural motifsin SCOP annotations. PTGL contains information about motifs,found in PDB in August 2009, so all SCOP (last release was at June2009) domains were analyzed by PTGL. See Table S.1 for list of re-sults for those families.

This database contains information about PDB codes and chainswhere motifs were detected, without assignment of the detectedmotifs to SCOP domains. Because of this a special subset of SCOPwas selected for testing: domains of this subset are the wholechain (whole-chain domains), not a region of it.

Results of testing on selected SCOP families are shown in Table 1.The lists of motifs which MotAn and PTGL detect, differ. Because ofthis, lists of motifs expected for the family in MotAn and PTGL re-sults differ too. Several SCOP folds and families were selected fortesting – these SCOP divisions contain hundreds of domains withexact annotations in SCOP, which allows to assign an ‘‘expected’’

Table 1Comparing MotAn and PTGL results on selected SCOP families.

SCOP foldor family

Expected result No. ofdomains

No. odoma

Expected MotAn reporta Expected PTGLreport

b.1 Interlock (Kister et al., 2002)(Ac, Bc, Bd, Be)

Immunoglobulinfold

7464 1141

b.121 Jellyroll (Bc, open or close) Jellyroll 589 328b.43 Single sheet Greek key (Ad) – 383 39b.3 Greek key of two sheets

(Ae, Bc (close), Be)– 688 368

b.68 b-Meander (Ba) b-Propeller 273 136f.4 b-Meander (Ba) Up-down-barrel 137 121b.80, b.81c b-Helix (Af) – 288 189d.58 – a/b-plait 2442 720c.1 b-a-b-helix (Bb) TIM-barrel 4245 1805c.2 b-a-b-motif (Ab, Bb) Rossmann fold 2365 506d.15 – Ubiquitin 967 317

a Some motifs can be detected as a single motif or as a part of more complex one. Every(see list of motifs in ‘‘Topology and motifs’’ section) was reported by the program.

b Domains of the fold ‘‘b.68’’ are 6-bladed b-propellers. The topology of every bladedetected. There were no domains with more than 6 detected meanders.

c Only b-helices of 2 and 3 b-sheets were examined.

results to be used as a gold standard. For example, fold b.121 is de-scribed as ‘‘sandwich; 8 strands in 2 sheets; jelly-roll; some mem-bers can have additional 1–2 strands characteristic interactionbetween the domains of this fold allows the formation of fivefoldand pseudo sixfold assemblies’’ in SCOP 1.75 and was used to testsensitivity of jellyroll detection by MotAn and PTGL.

For every tested family the list of expected motifs was createdfrom SCOP annotation (or literature data). MotAn and PTGL resultsfor domains of those families were compared with this ‘‘gold stan-dard’’. If both MotAn and PTGL results are expected for the group ofSCOP domains, MotAn is always more sensitive. (Note that motifsexpected in MotAn and PTGL results for the same family can differ.)

Best MotAn and PTGL results were shown for Rossmann fold.MotAn detects approximately 100% of motifs for this fold correctly,PTGL – 86.6%. The worst MotAn results were obtained for singlesheet Greek key (20.4%, see Section 4 for explanation of these ef-fect), the worst PTGL results – for Up-and-down barrels (7.4%).

The only structural motifs (in terms of topology, used in the cur-rent work) which ProMotif (Hutchinson and Thornton, 1996) de-tects are b-hairpins and b-a-b-motifs. Over 2311 whole-chaindomains from c.1 and c.2 folds ProMotif detects b-a-b-motifs in2257 domains (97.7%), while MotAn detects b-a-b-motifs or b-a-b-helices in 2299 (99.5%) domains. The other motifs which MotAndetects are not reported by the ProMotif algorithm.

MotAn results for dome families are described in details in Sup-plementary data.

3.3. Motifs spread among SCOP domains

All SCOP domains were screened by MotAn. There are 4181families in SCOP 1.75 which contain at least one domain from anon-obsolete PDB entry. 869 (20.8%) of those families are from all-a-class – they can contain a minor b-structural elements, but mostof their domains do not contain any structural motifs which MotAndetects. The most common structural motif was a b-hairpin – it wasdetected in 2932 (70.1%) of the families (including hairpins as a partof meander). The rarest motifs are jellyrolls (found in 3.1% of fami-lies) and b-helices (found in only 2.1% of families; a lot of them arefalse-positives). See Table S.3 for the results on all SCOP domains.

3.4. Searching for uncommon structural motifs

Some structural motifs, which can be detected by MotAn wererarely observed in SCOP domains. Some of those motifs (for exam-ple left-handed bab-motifs) were previously known to be rare

f whole-chainins

Correctly found by MotAn Correctly found byPTGL among thewhole-chain domains

Among alldomains

Among whole-chaindomains

6898 (92.4%) 914 (80.1%) 196 (17.2%)

425 (72.2%) 210 (64%) 80 (24.4%)78 (20.4%) 16 (41%) –425 (61.8%) 191 (51.9%) –

209 (76.6%)b 90 (66.2%)b 23 (16.9%)136 (99.3%) 120 (99.2%) 9 (7.4%)287 (99.7%) 188 (99.5%) –– – 508 (70.6%)4192 (98.8%) 1789 (99.1%) 962 (53.3%)2355 (99.6%) 506 (100%) 438 (86.6%)– – 172 (54.3%)

motif is considered as detected by MotAn if at least one of motifs listed in brackets

is a b-meander. MotAn results were considered to be correct if 6 meanders were

Page 5: Motif Analyzer for protein 3D structures

66 E. Aksianov / Journal of Structural Biology 186 (2014) 62–67

(Hutchinson and Thornton, 1996), the others (like parallel inter-locks) were included in the MotAn algorithm due to their similarityto common ones. A search for uncommon motifs was performed. Itwas shown that those motifs are much rarer than common ones(see Table S.4). It must be noted that a lot of detected uncommonmotifs could be false hits.

4. Discussion

4.1. Comparing different algorithms for motif search

Several programs were developed previously for the task of mo-tif search (PTGL (May et al., 2004, 2010), ProMotif (Hutchinson andThornton, 1996), ProsMos (Shi et al., 2007), Tableaux (Konagurthuand Lesk, 2013), SS3D-2P (Kato and Takahashi, 1997)). Algorithmsof those programs differ.

First, the task of motif search was formalized by different ways.Motifs’ definitions realized in MotAn are based on the strands’arrangement in b-sheets and their order along the polypeptidechain. These definitions reflect to the human-friendly descriptionsusually used in structural biology manuals, presentations andweb-sites.

In the other programs motifs’ definitions are based on approxi-mation of strands and helices by vectors and calculation of dis-tances and angles between them. This approach does not use theinformation about beta-sheets and strands’ arrangement in them.For example, in PTGL parallel and antiparallel paring betweenstrands is given by their special arrangement, and the task is to finda set of vectors with a given constrains for angles and distances be-tween them. In these programs two paired (by hydrogen bonds)(anti)parallel strands cannot be distinguished from two neighbor-ing non-paired strands (possibly from different sheets). For exam-ple, for the strand 84–89 (structure PDB: 1AD0, chain A) PTGLreports three antiparallel connections – with strands 4–7, 34–38and 102–106. Two of these connections correspond to the pairingbetween strands, while strand 4–7 is not really paired to 84–89,it is only disposed near. Those types of connections cannot bedistinguished in PTGL graph. Unlike this, MotAn (also as ProMotif)directly uses the information about pairing between strands givenfrom the output of secondary structure detector. So, data used bythese programs are less noisy than in case of PTGL and so on.

Some programs, like ProsMos and Tableaux take motif descrip-tions in a special format as their input and search for the similarstructure in a given PDB file. The others like PTGL and PROMOTIFcan reveal motifs from lists provided by developers. MotAn usesthe latter strategy. It allows the use of special algorithms forsearching every motif instead of using the generalized one, whichincreases sensitivity.

Third, unlike all other programs MotAn uses the SheeP programas a detector of b-sheets. It was shown (Aksianov and Alexeevski,2012), that results of SheeP are more adequate for structural anno-tation than results of the commonly used programs DSSP (Kabschand Sander, 1983) and STRIDE (Frishman and Argos, 1995). This al-lows for MotAn to be more accurate for the detection of motifsthan PTGL and other programs.

Alternatively, MotAn can use DSSP, STRIDE (optionally, modi-fied) and MakeSSP (Aksianov and Alexeevski, 2012) programs todetect secondary structures (b-sheet maps are created in any case).In some cases some of these algorithm variants demonstratesbetter sensitivity than default one. For example, Greek key in thestructure PDB: 1EOB, chain A is detected only when STRIDE(modified) or MakeSSP is used to detect secondary structures. Suchvariability is not allowed by the other motif detectors.

The unique feature of the SheeP is that the every b-sheet is gi-ven as a holistic object (imaged by the sheet map) instead of set ofseparated strands. Because of this, it is able to introduce two

directions on the sheet and interpret the input data in terms ofstrand arrangement in the sheet. This idea is formalized in thestrand notations used in MotAn. For example, subsheets (seeSupplementary data) were introduced according to this idea. Theyare used in a special procedure to detect interlocks and jellyrolls.

Last, the unique feature of the MotAn program is the core detec-tor. All others programs uses all strands of the structure as theobjects of the same type. The MotAn classifies SSEs into two classes– core ones and phantoms. As it is shown on the Fig. 2A it is impor-tant to take into account only the core elements (see Section 3.1also). This feature is strongly improves MotAn sensitivity.

4.2. Accuracy of motifs’ detection

MotAn’s results show a strong correlation with SCOP annota-tions: for the different motifs sensitivity was from 20.4% to 100%(see Table 1). It means that MotAn’s results correspond to manuallydetermined structural motifs in proteins. As described in ‘‘Results’’section, MotAn is a more sensitive detector of motifs than PTGL.

High sensitivity of the MotAn is based on two unique features ofthis program. First, SheeP is used as a secondary structure detector,instead of DSSP or STRIDE. SheeP was shown to be more accuratedetector than them. Second, special detector of core and phantomelements were realized in the MotAn program. It allows detectingstructural motifs with small additional elements (called phantoms)in their loops.

Notably, misdetection of motifs in structures can be attributedto at least two causes. In some cases it is a type II error of the pro-gram (false-negative result), like misdetection of the Greek key inthe structure PDB: 1EOB, chain A by the program with defaultparameters. In many other cases it is a consequence of uncertaintyof SCOP annotations. For example, we expect to detect bab-helix inthe structure PDB: 1Y8B, chain A because it is classified as TIM-bar-rel in SCOP. In this particular structure regions which form the cen-tral barrel do not organized in a large rolled b-sheet as in case ofother structures of the family. The network of hydrogen bonds ishighly disrupted. It is a reason why bab-motifs were not detected.

Basing on such examples, we notice that SCOP annotations arenot the ‘‘gold standard’’ containing no mistakes. They correlatewith the presence of structural motifs in the structures and MotAnresults correlates with SCOP data, which is a proof of our conclu-sion that the MotAn is a sensitive detector.

A lot of SCOP folds do not contain any motif annotation in theirdescription. MotAn detects a lot of motifs in those folds. For exam-ple, domains of all-a class can contain small b-strands in additionto their a-helical core. Those strands are often forms b-hairpins orbab-motifs. There is no ability to check if those detections arefalse-positives or true-positives except accurate manual inspectionof all of them, which is impossible due to a huge size of this task.Due to this reason nothing is known about MotAn selectivity forthe variety of motifs.

5. Conclusion

The detector of structural motifs in a given protein 3D structurecalled MotAn was developed and tested. It was shown that MotAnis a highly sensitive detector. It can be used to improve automaticprocedures for structural classifications.

Acknowledgments

Thanks to Andrei Alexeevski for discussion and Erik Hoogendo-orn for help in text preparation. This work was partially supportedby Russian Foundation for Basic Research Grants 13-07-00969 and14-04-31709_mol_a.

Page 6: Motif Analyzer for protein 3D structures

E. Aksianov / Journal of Structural Biology 186 (2014) 62–67 67

Appendix A. Supplementary data

Supplementary data associated with this article can be found, inthe online version, at http://dx.doi.org/10.1016/j.jsb.2014.02.017.

References

Aksianov, E., Alexeevski, A., 2012. SheeP: a tool for description of beta-sheets inprotein 3D structures. J. Bioinf. Comput. Biol. 10, 1241003.

Berman, H.M., Henrick, K., Nakamura, H., 2003. Announcing the worldwide ProteinData Bank. Nat. Struct. Biol. 10, 980.

Chakrabarti, S., Venkatramanan, K., Sowdhamini, R., 2003. SMoS: a database ofstructural motifs of protein superfamilies. Protein Eng. 16, 791–793.

Christine, A., Orengo, C.A., Thornton, J.M., 1993. Alpha plus beta folds revisited:some favoured motifs. Structure 1, 105–120.

Frishman, D., Argos, P., 1995. Knowledge-based protein secondary structureassignment. Proteins 23, 566–579.

Gordeev, A.B., Kargatov, A.M., Efimov, A.V., 2010. PCBOST: protein classificationbased on structural trees. Biochem. Biophys. Res. Commun. 397, 470–471.

Guruprasad, K., Prasad, M.S., Kumar, G.R., 2000. Database of structural motifs inproteins. Bioinformatics 16, 372–375.

Hutchinson, E.G., Thornton, J.M., 1996. PROMOTIF – a program to identify andanalyze structural motifs in proteins. Protein Sci. 5, 212–220.

Kabsch, W., Sander, C., 1983. Dictionary of protein secondary structure: patternrecognition of hydrogen-bonded and geometrical features. Biopolymers 22,2577–2637.

Kato, H., Takahashi, Y., 1997. SS3D-P2: a three dimensional substructure searchprogram for protein motifs based on secondary structure elements. Comput.Appl. Biosci. 13, 593–600.

Kister, A.E., Finkelstein, A.V., Gelfand, I.M., 2002. Common features in structures andsequences of sandwich-like proteins. Proc. Natl. Acad. Sci. USA 99, 14137–14141.

Konagurthu, A.S., Lesk, A.M., 2013. Structure description and identification using thetableau representation of protein folding patterns. Methods Mol. Biol. 932, 51–59.

May, P., Barthel, S., Koch, I., 2004. PTGL – a web-based database application forprotein topologies. Bioinformatics 20, 3277–3279.

May, P., Kreuchwig, A., Steinke, T., Koch, I., 2010. PTGL: a database for secondarystructure-based protein topologies. Nucleic Acids Res. 38, D326–D3230.

Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C., 1995. SCOP: a structuralclassification of proteins database for the investigation of sequences andstructures. J. Mol. Biol. 247, 536–540.

Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M., 1997.CATH – a hierarchic classification of protein domain structures. Structure 5,1093–1108.

Sayle, R.A., Milner-White, E.J., 1995. RASMOL: biomolecular graphics for all. TrendsBiochem. Sci. 20, 374.

Shi, S., Zhong, Y., Majumdar, I., Sri Krishna, S., Grishin, N.V., 2007. Searching forthree-dimensional secondary structural patterns in proteins with ProSMoS.Bioinformatics 23, 1331–1338.

Zhang, C., Kim, S.H., 2000. The anatomy of protein beta-sheet topology. J. Mol. Biol.299, 1075–1089.