Multiple protein-DNA interfaces unravelled by evolutionary ...Aug 22, 2019  · Nucleic Acid...

12
“JET2DNA” — 2019/8/22 — 11:28 — page 1 — #1 Published online 0000 bioRxiv, 0000, .... 00, .... 00 1–12 doi:.../.../gkn000 Multiple protein-DNA interfaces unravelled by evolutionary information, physico-chemical and geometrical properties. F. Corsi 1,2 , R. Lavery 3 , E. Laine 1 * , A. Carbone 1,4 * 1 Sorbonne Universit ´ e, CNRS, IBPS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), 75005 Paris, France. 2 Sorbonne Universit ´ e, CNRS, Institut des Sciences, du Calcul et des Donn´ ees (ISCD), 75005 Paris, France. 3 Lyon University, CNRS, IBCP, UMR 5086, Molecular Microbiology and Structural Biochemistry, 69367 Lyon, France. 4 Institut Universitaire de France, 75005 Paris, France. ABSTRACT The usage made of protein surfaces by nucleic acids still remains largely unknown, due to the lack of available structural data and the inherent complexity associated to protein surface deformability and evolution. In this work, we present a method that contributes to decipher such complexity by predicting protein-DNA interfaces and characterizing their properties. It relies on three biologically and physically meaningful descriptors, namely evolutionary conservation, physico-chemical properties and surface geometry. We carefully assessed its performance on several hundreds of protein structures. We achieve a higher sensitivity compared to state-of-the-art methods, and similar precision. Importantly, we show that our method is able to unravel ’hidden’ binding sites by applying it to unbound protein structures and to proteins binding to DNA via multiple sites and in different conformations. It is implemented as a fully automated tool, JET 2 DNA , freely accessible at: http://www.lcqb.upmc.fr/JET2DNA. We also provide a new reference dataset of 187 protein- DNA complex structures, representative of all types of protein-DNA interactions, along with a subset of associated unbound structures: http://www.lcqb.upmc.fr/PDNAbenchmarks. INTRODUCTION Interactions between proteins and DNA play a fundamental role in essential biological processes (1) and their impairement is associated with human diseases (2, 3). Thus, they represent increasingly important therapeutic targets. The experimental determination of protein-DNA complexes is a costly and time consuming process (<5000 protein-DNA complex structures available in the Protein Data Bank (4), release 2019). This has motivated the development of a large number of computational * To whom correspondence should be addressed. Email: [email protected] and [email protected] methods, most of them based on machine learning, to predict DNA-binding residues (5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24). Both sequence- based properties (conservation, amino acid type, predicted secondary structure, solvent accessibility and disorder) and structural properties (electrostatic potential, dipole moment, surface shape and curvature, structural motifs, secondary structure, amino acid microenvironment, solvent accessibility and hydrogen-bonding potential) have been considered. Among those, amino acid composition is one of the most powerful feature. Indeed, positively charged and polar amino acids are largely over-represented in DNA-binding sites, in order to counterbalance the excess of negative charge coming from the DNA phosphate groups (25). Methods using a large body of features can achieve very high accuracy on known binding sites but generally lack interpretability. Our understanding of protein-DNA interactions is hampered by the fact that protein surfaces are complex dynamical objects which may interact with several DNA molecules at the same time or at different moments but via the same region (26, 27, 28), accommodate indifferently DNA and RNA (29) and undergo substantial conformational changes upon binding. A protein may harbour multiple DNA-binding sites on its surface, each of which may have a different role with a different level of importance in the accomplishment of the protein’s function (28, 30) and thus display particular properties. The relatively small amount of available structural data gives us only a glimpse of the many ways nucleic acid molecules use protein surfaces. This calls for the development of tools able to comprehensively identify and characterize protein-DNA interfaces and help decipher protein surface complexity. Here, we present JET 2 DNA , a new method for predicting DNA-binding sites at protein surfaces based on a few biologically and physically meaningful parameters. Specifically, we use evolutionary conservation, physico- chemical properties and local/global geometry. These descriptors recently proved useful to identify and characterize c 0000 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. . CC-BY-NC-ND 4.0 International license available under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which was this version posted August 22, 2019. ; https://doi.org/10.1101/743617 doi: bioRxiv preprint

Transcript of Multiple protein-DNA interfaces unravelled by evolutionary ...Aug 22, 2019  · Nucleic Acid...

Page 1: Multiple protein-DNA interfaces unravelled by evolutionary ...Aug 22, 2019  · Nucleic Acid Database (34) (February 2016 release). This list was filtered using PISCES sequence culling

ii

“JET2DNA” — 2019/8/22 — 11:28 — page 1 — #1 ii

ii

ii

Published online 0000 bioRxiv, 0000, .... 00, .... 00 1–12doi:.../.../gkn000

Multiple protein-DNA interfaces unravelled byevolutionary information, physico-chemical andgeometrical properties.F. Corsi 1,2, R. Lavery 3, E. Laine 1 ∗, A. Carbone 1,4 ∗

1 Sorbonne Universite, CNRS, IBPS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative (LCQB),75005 Paris, France.2 Sorbonne Universite, CNRS, Institut des Sciences, du Calcul et des Donnees (ISCD), 75005 Paris, France.3 Lyon University, CNRS, IBCP, UMR 5086, Molecular Microbiology and Structural Biochemistry, 69367 Lyon, France.4 Institut Universitaire de France, 75005 Paris, France.

ABSTRACT

The usage made of protein surfaces by nucleicacids still remains largely unknown, due tothe lack of available structural data and theinherent complexity associated to protein surfacedeformability and evolution. In this work, wepresent a method that contributes to decipher suchcomplexity by predicting protein-DNA interfacesand characterizing their properties. It relies onthree biologically and physically meaningfuldescriptors, namely evolutionary conservation,physico-chemical properties and surface geometry.We carefully assessed its performance on severalhundreds of protein structures. We achieve a highersensitivity compared to state-of-the-art methods,and similar precision. Importantly, we show thatour method is able to unravel ’hidden’ binding sitesby applying it to unbound protein structures andto proteins binding to DNA via multiple sites andin different conformations. It is implemented asa fully automated tool, JET2

DNA, freely accessibleat: http://www.lcqb.upmc.fr/JET2DNA. We alsoprovide a new reference dataset of 187 protein-DNA complex structures, representative of alltypes of protein-DNA interactions, along witha subset of associated unbound structures:http://www.lcqb.upmc.fr/PDNAbenchmarks.

INTRODUCTION

Interactions between proteins and DNA play a fundamentalrole in essential biological processes (1) and their impairementis associated with human diseases (2, 3). Thus, they representincreasingly important therapeutic targets. The experimentaldetermination of protein-DNA complexes is a costly and timeconsuming process (<5000 protein-DNA complex structuresavailable in the Protein Data Bank (4), release 2019). This hasmotivated the development of a large number of computational

∗To whom correspondence should be addressed. Email: [email protected] and [email protected]

methods, most of them based on machine learning, to predictDNA-binding residues (5, 6, 7, 8, 9, 10, 11, 12, 13, 14,15, 16, 17, 18, 19, 20, 21, 22, 23, 24). Both sequence-based properties (conservation, amino acid type, predictedsecondary structure, solvent accessibility and disorder) andstructural properties (electrostatic potential, dipole moment,surface shape and curvature, structural motifs, secondarystructure, amino acid microenvironment, solvent accessibilityand hydrogen-bonding potential) have been considered.Among those, amino acid composition is one of the mostpowerful feature. Indeed, positively charged and polar aminoacids are largely over-represented in DNA-binding sites, inorder to counterbalance the excess of negative charge comingfrom the DNA phosphate groups (25). Methods using a largebody of features can achieve very high accuracy on knownbinding sites but generally lack interpretability.

Our understanding of protein-DNA interactions ishampered by the fact that protein surfaces are complexdynamical objects which may interact with several DNAmolecules at the same time or at different moments but viathe same region (26, 27, 28), accommodate indifferentlyDNA and RNA (29) and undergo substantial conformationalchanges upon binding. A protein may harbour multipleDNA-binding sites on its surface, each of which may havea different role with a different level of importance in theaccomplishment of the protein’s function (28, 30) and thusdisplay particular properties. The relatively small amount ofavailable structural data gives us only a glimpse of the manyways nucleic acid molecules use protein surfaces. This callsfor the development of tools able to comprehensively identifyand characterize protein-DNA interfaces and help decipherprotein surface complexity.

Here, we present JET2DNA, a new method for predicting

DNA-binding sites at protein surfaces based on a fewbiologically and physically meaningful parameters.Specifically, we use evolutionary conservation, physico-chemical properties and local/global geometry. Thesedescriptors recently proved useful to identify and characterize

c© 0000 The Author(s)This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

.CC-BY-NC-ND 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which wasthis version posted August 22, 2019. ; https://doi.org/10.1101/743617doi: bioRxiv preprint

Page 2: Multiple protein-DNA interfaces unravelled by evolutionary ...Aug 22, 2019  · Nucleic Acid Database (34) (February 2016 release). This list was filtered using PISCES sequence culling

ii

“JET2DNA” — 2019/8/22 — 11:28 — page 2 — #2 ii

ii

ii

2 bioRxiv, 0000, Vol. 00, No. 00

protein-protein interfaces (31, 32). JET2DNA predicts

surface patches following the support-core-rim modelfor experimental protein-protein interfaces, where interactingresidues are classified based on their structural role in theinteraction (33). To the best of our knowledge, this is thefirst time that this model is used for studying protein-DNAinterfaces. JET2

DNA implements three different scoringstrategies, aimed at detecting different types of protein-DNAinterfaces and different subregions of a given interface.

We assessed JET2DNA performance on two new

representative benchmarks of high-resolution structures,comprised of 187 protein-DNA complexes (HR-PDNA187)and 82 associated bound-unbound pairs (HOLO-APO82),and on an independent test set of 82 nucleic acid bindingproteins (TEST-NABP82), taken from (18). We showthat JET2

DNA predictions are very accurate and robust toprotein conformational and stoichiometry changes associatedto DNA binding. In addition, rigorous comparison withthree established machine learning based methods, namelyDISPLAR (7), multiVORFFIP (15) and DRNApred (18)demonstrates the better predictive power of JET2

DNA on alldatasets. Specifically, JET2

DNA is able to significantly detectmore interacting residues while retaining similar precisionand to unravel ’hidden’ alternative binding sites. We showhow one can learn about the origins and specificities ofdifferent types of protein-DNA interfaces through directinterpretation of JET2

DNA predictions. With this work, wepave the way to the discovery of yet unknown binding sites,opening up new perspectives for drug design.

We provide the structures and the experimentallyknown DNA-binding sites for our two newly createddatasets, HR-PDNA187 and HOLO-APO82 at:http://www.lcqb.upmc.fr/PDNAbenchmarks/. They arerepresentative of all known protein-DNA interactions and canbe used as benchmark sets for evaluating DNA-binding sitepredictors and DNA-protein docking methods. The code ofJET2

DNA is available at: http://www.lcqb.upmc.fr/JET2DNA.

MATERIALS AND METHODS

DatasetsHR-PDNA187. The complete list of 1257 protein-doublestrand DNA complexes determined by X-ray crystallographywith a resolution better than 2.5A was downloaded from theNucleic Acid Database (34) (February 2016 release). This listwas filtered using PISCES sequence culling server (35) todefine a set of 222 protein-DNA complexes non-redundant at25% sequence identity, with an R-factor lower than 0.3 andwith at least one protein chain longer than 40 amino acids. Thecomplexes’ 3D structures were downloaded from the PDB (4).

Subsequently, the dataset was manually curated to ensure itsgood quality. We removed entries where: (1) the asymmetricunit did not contain at least one biological unit or (2) theDNA molecule was single-stranded or contained less than5 base pairs or (3) only Cα atoms were present. Moreover,only chains with more than one contact with the DNA wereretained. When the biological unit contained multiple copiesof the protein-DNA complex, only one copy was kept. Thestructure 4hc9 was excluded because it displays differentDNA-binding sites in the asymmetric and biological units.

The complex 4aik was substituted with the 100% homolog4aij, where the DNA-binding site is twice bigger. Finally,we removed the only antibody present in the dataset (3vw3),since this class of proteins have very peculiar characteristicsand should be treated separately. In total, we retained 187complexes (Supp. Table S1).

HR-PDNA187 covers all major groups of DNA-proteininteractions according to Luscombe et al. classification (1):helix-turn-helix (HTH), zinc-coordinating, zipper type, otherα-helical, β-sheet, β-hairpin/ribbon, other. It spans a widerange of functional classes: it comprises 100 enzymes, 78regulatory proteins, 7 structural proteins, 1 protein with otherfunction and 1 unclassified protein (Supp. Table S1).

HOLO-APO82. We collected all available X-ray structuresof the APO forms of the proteins from HR-PDNA187. Weused the blastp program from the BLAST+ package (36)from NCBI with a threshold of 95% for seq. id., 10−3 for theE-value, a percentage of coverage ≥ 70% and a percentage ofgaps ≤ 10% with respect to the query protein chain. Amongthe structures matching these criteria, only the ones having thesame UniProt code (37) or belonging to the same organism asthe query sequence were retained. If several structures passedall filters, only the closest one to the query or the one withthe highest resolution was chosen. We found unbound formsfor 81 complexes. For the complex 2isz, the unbound proteinwas found as a monomer and also as a dimer in two differentPDB structures. Both structures were retained as they wereboth reported as present in equilibrium (38). The resultinglist comprises a total of 82 HOLO(bound)-APO(unbound)pairs. Within each pair, the APO form may be in the sameoligomeric state as the HOLO form or may have fewer chains(Supp. Table S1).

TEST-NABP82. We used an additional recent dataset (18)of 82 proteins bound to nucleic acids, with a resolution betterthan 2.5 A. Among these, 49 proteins were solved boundto DNA (TEST-DBP49) and 33 to RNA (TEST-RBP33).25 DNA-binding proteins share more than 30% sequenceidentity with proteins from HR-PDNA187. To fairly assessJET2

DNA performance, they were removed from the dataset(TEST-DBP24).

Definition of interface residuesFor each bound form from HR-PDNA187, we calculatedthe residues relative accessible surface areas in presence(rasaDNA) and absence (rasafree) of DNA, usingNACCESS 2.1.1 (39) with a probe size of 1.4A. Interfaceresidues were defined as those showing any change in theirrasa upon binding (∆rasa>0). They were classified inthree structural components (33): support residues are buriedin presence (rasaDNA<0.25) and in absence of DNA(rasafree<0.25); core residues are exposed in absence ofDNA (rasafree≥0.25) and become buried upon binding(rasaDNA<0.25); rim residues are exposed in presence(rasaDNA≥0.25) and in absence (rasafree≥0.25) of DNA(Fig. 1a).

The interfaces for TEST-NABP82 were directly taken from(18). They were defined from the PDB entries comprised inthe dataset and also PDB structures of identical or similar

.CC-BY-NC-ND 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which wasthis version posted August 22, 2019. ; https://doi.org/10.1101/743617doi: bioRxiv preprint

Page 3: Multiple protein-DNA interfaces unravelled by evolutionary ...Aug 22, 2019  · Nucleic Acid Database (34) (February 2016 release). This list was filtered using PISCES sequence culling

ii

“JET2DNA” — 2019/8/22 — 11:28 — page 3 — #3 ii

ii

ii

bioRxiv, 0000, Vol. 00, No. 00 3

complexes (seq. id. ≥80% and TM-score≥0.5) (40, 41).This allows to account for interface variability coming frommolecular flexibility. Nevertheless, we should stress thatthese interfaces were defined using a very stringent distancecriterion of 3.5 A, and hence they are substantially smallerthan those we defined from HR-PDNA187. We used themto fairly assess JET2

DNA performance on an independentdataset, not created by us, and to fairly compare JET2

DNA withDRNApred (18), which was evaluated on them.

Residue descriptorsEvolutionary conservation. Conservation levels are computedusing the Joint Evolutionary Trees (JET) method (42). Thismeasure is inspired but different from the evolutionary traceintroduced in (43, 44). Briefly, the algorithm performs a PSI-BLAST search (45) to retrieve a set of sequences homologousto the query. These sequences are then sampled by a Gibbs-like approach and from each sample a phylogenetic tree isconstructed (31, 42). From each tree, a tree trace is computedfor each position in the query sequence: it corresponds tothe tree level where the amino acid at this position appearedand remained conserved thereafter (31, 42). Tree traces areaveraged over all trees to get more statistically significantvalues, denoted as TJET, which vary in the interval [0,1].

Physico-chemical properties. Propensities specific to everyamino acid to be located at a protein-DNA interface (PCDNA)were taken from (46) (see Supp. Fig. S1). The original values,ranging from 0 to 2.534, were scaled between 0 and 1.

Circular variance. This geometrical measure gives thevectorial distribution of a set of neighboring points, within asphere of radius rc, around a fixed point in 3D space (47, 48).Given a residue, it reflects the protein density around it. A lowCV value (close to 0) indicates that a residue is located in aprotruding region of the protein surface, while a high value(close to 1) indicates it is buried within the protein. Varyingradius rc allows describing the local (CVlocal, rc=12A) andthe global (CVglobal, rc=100A) geometry of the surface.

JET2DNA workflow

The JET2DNA method requires as input a protein query

sequence for which three-dimensional (3D) structuralcoordinates are available in the PDB. It computes TJET,PCDNA, CVglobal and CVlocal values and it combines themto assign a score to each residue. Different combinations areimplemented in three different scoring schemes (D-SC inFig. 1b) designed to detect different types of protein-DNAinterfaces (see Results). The computed scores are used to rank,select and cluster protein residues. The clustering algorithm isadapted from the protein-protein interface prediction methodJET/JET2 (31, 42). First, highly scored residues proximalin 3D space are clustered together to form seeds. Then, theseeds are extended by progressively adding highly scoredneighboring residues. At this stage, if two predicted patchesare in contact, they will be merged. Finally, the prediction iscompleted with an outer layer of residues. The seed, extensionand outer layer (Fig. 1a, in red, orange and cyan) approximatethe support, core and rim (Fig. 1a, in yellow, brown and

green) detected in experimental interfaces. For a moredetailed description of the 3-steps clustering procedure seeSupp. Table S2. JET2

DNA implements an automated procedureto choose the most appropriate scoring scheme for the studiedsystem (Fig. 1c; see also Automated clustering procedure).Alternatively, the user can manually choose a scoring scheme.The user has the possibility to complement the predictionswith another round of the clustering procedure using acomplementary scoring scheme (Fig. 1c; see also Completeclustering procedure). To get more robust predictions, severaliterations of JET2

DNA can be run (iJET2DNA; see also Iterative

mode). A schematic representation of the JET2DNA pipeline

is reported in Supp. Fig. S2. In the following, we explain indetails the new criteria and new procedures implemented inJET2

DNA, compared to JET2 (31).

Modification of the expected size of the interface. Ateach step of the clustering procedure, JET2

DNA usestwo thresholds, namely score

layerres and score

layerclus , that

respectively determine which residues should be consideredas candidates for the predicted patches and when the processof growing each patch should be stopped. The set up of thesethreshold is based on the expected relative size of a protein-DNA interface, fDNA

intfrac(x). fDNAintfrac(x) was determined by

plotting the percentage of interface residues versus the totalnumber of surface residues for HR-PDNA187 (Supp. Fig. S3,circles). The function that best approximates our data isfDNAintfrac(x)=(2.66/

√x)+0.03 (Supp. Fig. S3, solid line),

where x is the number of protein surface residues. Comparedto protein-protein interfaces (31), DNA-binding sites cover alarger portion of the protein surface (Supp. Fig. S3, comparedotted and solid lines).

Dynamic set up of the residue and cluster thresholds.Contrary to JET2 algorithm, where scorelayerres and scorelayerclusthresholds were fixed at the beginning of the clusteringprocedure and could not be changed, in JET2

DNA we decided toinitially fix them more stringently and relax them in a secondstage, if the prediction is smaller than two thirds (<70%) ofthe expected size (Supp. Fig. S2, red). This dynamic set uplimits false positives in cases where surface regions outsidebut close to the interface display a detectable signal. It shouldbe noted that the thresholds in JET2

DNA, even relaxed, remainstricter than the ones in JET2. Details about threshold valuesused in JET2

DNA are given in Supp. Table S3.

Avoiding small-ligand binding pockets. Like protein-proteininteractions, protein-DNA interactions are often mediatedor regulated by small ligands. As a result, a significantnumber of protein-protein and protein-DNA interfaces areclose to or overlapping small ligand-binding pockets. Thesepockets are generally very conserved (e.g. active sites ofenzymes) (31) and may thus confound the prediction. InJET2, we resolved the issue by designing a specific scoringscheme exploiting the fact that small ligand-binding pocketsare more deeply buried than protein-protein interfaces (seeSC2 in Fig. 2 from (31)). The specific detection of protein-DNA interfaces is more difficult, as these interfaces aremore concave than protein-protein interfaces (compare 1-CV

.CC-BY-NC-ND 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which wasthis version posted August 22, 2019. ; https://doi.org/10.1101/743617doi: bioRxiv preprint

Page 4: Multiple protein-DNA interfaces unravelled by evolutionary ...Aug 22, 2019  · Nucleic Acid Database (34) (February 2016 release). This list was filtered using PISCES sequence culling

ii

“JET2DNA” — 2019/8/22 — 11:28 — page 4 — #4 ii

ii

ii

4 bioRxiv, 0000, Vol. 00, No. 00

rim

core

support

extension

outer layer

seed

(TJET + (1 - CVlocal)) / 2

(TJET + PCDNA) / 2

(TJET + CVglobal )/ 2

(PCDNA + (1 - CVlocal))/ 2

D-SC1

D-SC2

D-SC3

1st Round 2nd Round

Detect Seeds

else if ⟨T JET(Seeds)⟩ > 0.6 and ⟨PCDNA(Seeds )⟩ < 0.9

if ⟨T JET(Seeds) ⟩ < 0.3

else

D-SC1 +

D-SC2

D-SC3

+

+ D-SC1

D-SC3

D-SC3

(a)

(b) (c)

General case

Concave interfaces

Lowly conserved

rim

core

support

outer layer

seed

extension

Figure 1. Experimental interface definition, JET2DNA scoring schemes and complete automated clustering procedure. (a) Top, sections of two experimental

interfaces (on the left, PDB code: 1JE8; on the right, PDB code: 1D02). Bottom, the corresponding JET2DNA prediction using D-SC2. The experimental and

predicted interface residues are displayed in opaque surfaces: support, core and rim are in yellow, brown and green, respectively; cluster seed, extension andouter layer are in red, orange and cyan, respectively; (b) Schematic representation of the three scoring schemes provided in JET2

DNA. TJET: conservation level,PCDNA: protein-DNA interface propensities, CVlocal and CVglobal: local and global circular variance computed with a radius of 12 A and 100 A , respectively.Different colors correspond to different combinations of properties used to predict interface residues in the three steps of the clustering procedure. (c) Schematicrepresentation of the complete automated JET2

DNA clustering procedure.

boxplots in Fig. 2). To tackle the problem, we implementeda procedure in JET2

DNA that redefines seeds when they aretoo buried. Specifically, if a seed comprises a significantproportion (>20% for D-SC1, >30% for D-SC2) of highlyburied residues, then the clustering procedure restarts avoidingsuch residues (Supp. Fig. S2, blue). The procedure does notapply to D-SC3 as this scoring scheme specifically selectsprotruding/exposed residues. Residues were considered ashighly buried if their CVlocal was higher than CVlocal

high =0.9. They represent less than 3% of the protein surface(Supp. Fig. S8).

Filtering out putative false positive clusters. In JET2, smallpatches were filtered out based on the comparison betweentheir size and the size distribution of randomly generatedpatches (42) after the seed or extension steps, dependingon the scoring scheme used. In the case of protein-DNAinterfaces, we observed that this procedure removed too manytrue positives. Hence, we modified it in JET2

DNA. Namely,we first detect all possible seeds and extend them. Then, ifthe patches represents more than two thirds (>70%) of theexpected size of the interface, we iteratively filter them startingfrom the smallest one (Supp. Fig. S2, green). To eliminateaberrant predictions, clusters composed of 1 or 2 residues arestill systematically filtered out.

Automated clustering procedure.The implemented algorithm is described in Fig. 1 andSupp. Table S4. By default, JET2

DNA first detects seedsusing D-SC1 (TJET+ PCDNA). If these display a very lowconservation signal (average TJET<0.3) then the strategy isto exploit the other two descriptors and look for locallyprotruding residues that satisfy expected physico-chemicalproperties (D-SC3). Otherwise, the algorithm analyses thephysico-chemical and the geometrical properties of thedetected seeds. If the seeds are in a very concave region(CVglobal >0.6) and do not display highly favorable physico-chemical properties (PCDNA <0.9), then the algorithmswitches from D-SC1 to D-SC2, where CVglobal is employedto accurately detect an ”enveloping” (concave) interface.Otherwise, physico-chemical properties are considered thedriving force for an accurate prediction of the interface.

Complete clustering procedure.A complete procedure is available for both manual andautomated clustering procedure and is described in Fig. 1 andSupp. Table S4 (see also Supp. Fig. S2, yellow). It consists incomplementing the main clusters predicted in the first roundby secondary clusters detected by another scoring scheme. D-SC3 will be used in the second round if D-SC1 or D-SC2

.CC-BY-NC-ND 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which wasthis version posted August 22, 2019. ; https://doi.org/10.1101/743617doi: bioRxiv preprint

Page 5: Multiple protein-DNA interfaces unravelled by evolutionary ...Aug 22, 2019  · Nucleic Acid Database (34) (February 2016 release). This list was filtered using PISCES sequence culling

ii

“JET2DNA” — 2019/8/22 — 11:28 — page 5 — #5 ii

ii

ii

bioRxiv, 0000, Vol. 00, No. 00 5

were chosen as main scoring schemes. D-SC1 will be thecomplementary one when D-SC3 is used in the first round.

Iterative mode.Multiple JET2

DNA runs may lead to slightly differentpredictions, due to the Gibbs sampling of the sequences. Toget more robust predictions, JET2

DNA can be run in an iterativemode of the program, which we call iJET2

DNA. In this way, wecan compute the number of times a given residue is detectedin a cluster divided by the total number of runs. The resultwill be a number comprised between 0 and 1 and it reflects theprobability of the given residue to be at an interface.

Evaluation of performancesWe used six standard measures of performance:Sens= TP

TP+FN ; PPV = TPTP+FP ; F1= 2·Sens·PPV

Sens+PPV ;

Spe= TNTN+FP ; Acc= TP+TN

TP+FN+TN+FP ;

MCC= TP ·TN−FP ·FN√(TP+FP )(TP+FN)(TN+FP )(TN+FN)

where TP (true positives) are the number of residuescorrectly predicted as interacting, TN (true negatives) are thenumber of residues correctly predicted as non-interacting, FP(false positives) are the number of non-interacting residuesincorrectly predicted as interacting and FN (false negatives)are the number of interacting residues incorrectly predictedas non-interacting. To assess the statistical significance of thedifferences in performance between each pair of methods, werelied on the paired t-test when the distributions were normaland on the Wilcoxon signed-rank test (49) otherwise. To verifyif the data were normally distributed we used the Anderson-Darling test (50). The difference was considered significantwhen the p-value was lower than 0.05.

Choice of the other methodsTo compare JET2

DNA performance, we considered a large setof popular DNA-binding site predictors (5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 51, 52). Amongthose, we discarded BindN (8), BindN+ (9), BindN-RF (12),MetaDBsite (51), PreDs (21), DBindR (13), PreDNA (10),DBD-Hunter (23), RBscore (20, 52) and PDNAsite (16) astheir web servers are no longer available or do not work. Wefinally retained three methods with a relatively short runtime,namely DISPLAR (7), multiVORFFIP (15) and DRNApred(18). DISPLAR was reported to show better performancethan DP-Bind (11) and is more recent than DBS-Pred (6)and DBS-PSSM (5). multiVORFFIP was chosen becauseit showed good performance on protein-protein interfacesprediction (31). DRNApred is a very recent predictor thatuses only sequence information and which aims at specificallydetecting DNA-binding residues, discriminating them fromRNA-binding residues. We ran the three tools with the defaultparameters. Since multiVORFFIP does not provide binaryprediction results (binding versus non-binding residues),predicted residues were defined by those with a normalizedscore (or probability) > 0.5.

RESULTS

Support-Core-Rim vs Core-Support-Rim model forprotein-DNA interactionsTo identify characteristic features of protein-DNA interfaces,we analysed the non-redundant set of protein structuresbound to DNA available in the PDB. This dataset, HR-PDNA187, comprises 187 structures, covers all major typesof DNA-interactions and spans a wide range of proteinfunctions (see Materials and Methods). Interface residueswere classified based on the support-core-rim model (Fig. 1a,on top) proposed for protein-protein interfaces (33). Thesupport (in yellow) comprises residues buried in presenceand absence of DNA, the core (in brown) comprises residuesexposed in absence of DNA and becoming buried uponbinding, and the rim (in green) comprises residues exposed inpresence and absence of DNA (see Materials and Methods). Inprotein-protein interfaces, the three components are spatiallyorganized in concentric layers, with the support at the center,the core in an intermediate position and the rim on the externalborder (see Fig. 2 in (31)). In protein-DNA interfaces, inaddition to this spatial organization (Fig. 1a, top left), wealso observe an organization where the support and the coreswitch positions (Fig. 1a, top right). This variety reflects thedifferent ways a protein may bind to DNA. Nevertheless, inboth organizations, the core plays the same physical role bystacking into the DNA grooves, while support residues tend toaccomodate the DNA backbones.

Characteristics of protein-DNA interfacesTo characterize experimental protein-DNA interfaces, werelied on evolutionary conservation (TJET), DNA-bindingpropensity (PCDNA) and local and global burial degree(CVlocal and CVglobal, see Materials and Methods for precisedefinitions of the descriptors). Specifically, we computed thepercentage of residues located in the support, core and rim,displaying higher values than the median computed over thewhole protein surface (Fig. 2a-b). The resulting distributionswere compared with those obtained for 176 protein-proteincomplexes (53) (Fig. 2c).

DNA binding sites, especially the support and the core,are significantly more conserved than the rest of the protein(Fig. 2a, TJET). This conservation signal is stronger than inprotein-protein interfaces (compare Fig. 2a and c). Residueswith high DNA-binding propensities tend to be located inthe core and rim (Fig. 2a, PCDNA). By contrast, residuesdisplaying physico-chemical properties favourable to protein-protein binding are mainly found in the support, and toa lesser extent in the core, of protein-protein interfaces(Fig. 2c, PCprot). These different geometrical distributionsreflect differences between the two PC scales (Supp. Fig. S1).Protein-DNA interfaces are enriched in positively chargedand polar residues which tend to prefer regions exposedto the solvent and preferentially bind the negative DNAbackbone in the most external parts of the interface(46). By contrast, protein-protein interfaces are enriched inhydrophobic residues which prefer to be located towards theinterior of the interface. Regarding the surface geometry,the core and rim of protein-DNA interfaces are enriched inlocally protruding residues (Fig. 2a, 1-CVlocal), as observed

.CC-BY-NC-ND 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which wasthis version posted August 22, 2019. ; https://doi.org/10.1101/743617doi: bioRxiv preprint

Page 6: Multiple protein-DNA interfaces unravelled by evolutionary ...Aug 22, 2019  · Nucleic Acid Database (34) (February 2016 release). This list was filtered using PISCES sequence culling

ii

“JET2DNA” — 2019/8/22 — 11:28 — page 6 — #6 ii

ii

ii

6 bioRxiv, 0000, Vol. 00, No. 00

(a) (b) (c)

Figure 2. Signals detected in experimental interfaces. The boxplots represent the distributions of the proportions of interacting residues having values abovethe median value computed over the entire protein surface. TJET: conservation level, PCDNA: protein-DNA interface propensities, PCprot: protein-protein interfacepropensities, CVlocal and CVglobal: local and global circular variances computed with a radius of 12 A and 100 A, respectively. Distributions are computed fromHR-PDNA187 on: (a) all protein-DNA interfaces, (b) all polymerases interfaces, (c) all protein-protein interfaces (31). The support, core and rim are in yellow,brown and green, respectively.

for protein-protein interfaces (Fig. 2c). When looking at thestatus of the interface with respect to the global shape ofthe protein surface (1-CVglobal), we found that the enzymes,and in particular the polymerases and some nucleases, displaya very specific profile (compare Fig. 2a and 2b). All threeinterface components, especially the support and the rim, arelocated in concave protein regions (Fig. 2b), indicating thatthese proteins bind to DNA by ”enveloping” it.

Overall, this analysis showed that protein-DNA interfacesencode signals that can be described by a few residue-basedfeatures and that the way these signals are distributed can bedescribed using the support-core-rim model. It also confirmedthat protein-DNA interfaces are more conserved than protein-protein interfaces, as reported in (54, 55, 56). They displayspecific physico-chemical and geometrical characteristics thatvary depending on the type of protein-DNA interactionconsidered. Hence, the correct detection of these interfacesrequires the development of adapted scoring strategies.

Three strategies to detect protein-DNA interfacesFollowing the support-core-rim model (Fig. 1a, on top), weuse a predictive model comprising three components (Fig. 1a,bottom), namely the seed (in red), the extension (in orange)and the outer layer (in cyan). JET2

DNA implements differentalgorithms and scores to detect each one of these components(see Materials and Methods). To be able to predict a widerange of protein-DNA interfaces, we devised three scoringschemes (see D-SC1-3, Fig. 1b). Each D-SC combines TJET,PCDNA, CVlocal and CVglobal in a different way. As thesupport and the core may exchange their positions, the samecombination is used for seed detection and extension in all D-SC (Fig. 1b, same color for the two first layers). Specifically,

• D-SC1 first detects highly conserved residues withgood physico-chemical properties (TJET+ PCDNA)and completes the prediction with conserved locally

protruding residues (TJET+ (1 - CVlocal)). It aims atdetecting generic DNA-binding sites.

• D-SC2 clusters together conserved residues located inhighly concave regions (TJET+ CVglobal); then, theprediction is completed by adding an outer layer ofconserved residues displaying good physico-chemicalproperties (TJET+ PCDNA). It is designed to specificallydetect interfaces characteristic of the ”enveloping”binding mode displayed by polymerases.

• D-SC3 leaves out conservation, focusing on locallyprotruding residues displaying good physico-chemicalproperties (PCDNA + (1 - CVlocal)). It is intended todeal with cases where no evolutionary information isavailable or the whole protein displays a homogeneousconservation signal. Moreover, it is useful to completepredictions of the previous two scoring schemes, whensome residues are less or no conserved than the restof the binding site, sometimes because they are DNAsequence specific for a subfamily of the DNA-bindingprotein (see Fig. 3, bottom panel, and Discovery ofalternative DNA-binding sites).

Examples of predictions are shown on Fig. 3. D-SC1 isparticularly suited for double-headed interfaces stacking inthe DNA grooves and some single-headed interfaces. Thesebinding modes are often adopted by transcription factors,regulatory proteins and some glycosilases. The ”enveloping”mode captured by D-SC2 is found in many polymerases andnucleases. D-SC3 deals with cases where physico-chemicaland geometrical properties have a better discriminative powerthan conservation. By default, JET2

DNA will automaticallydetermine the scoring scheme the most suited to the proteinbeing analyzed (Fig. 1c). Alternatively, the user can applythe scoring scheme of his/her choice. Some DNA-bindingsites may be comprised of several regions exhibiting differentsignals. To correctly detect those, JET2

DNA implements a

.CC-BY-NC-ND 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which wasthis version posted August 22, 2019. ; https://doi.org/10.1101/743617doi: bioRxiv preprint

Page 7: Multiple protein-DNA interfaces unravelled by evolutionary ...Aug 22, 2019  · Nucleic Acid Database (34) (February 2016 release). This list was filtered using PISCES sequence culling

ii

“JET2DNA” — 2019/8/22 — 11:28 — page 7 — #7 ii

ii

ii

bioRxiv, 0000, Vol. 00, No. 00 7

Complex Exp. interface iJET2DNA

1J3E

:A2G

B7:

A,B

1TC

3:B

JET2 also allows the user to manually choose a particular scoring scheme. In the following wereport JET2 performance on a number of different interfaces.

Detecting lowly conserved interface residues based on the proteinsurface local geometryTo describe the local geometry of the protein surface, we use the measure of circular variance(CV) that evaluates the density of protein around an atom. This simple geometric descriptorcaptures the structural properties of interacting residues. To properly assess the predictive roleof CV, we compared JET2 and its iterative version iJET2 (see Materials and Methods) to JET/iJET, which use only sequence information. We applied both methods to two testing sets,namely PPDBv4 and the Huang dataset of 62 protein complexes [20] (S1 Table). One shouldnotice that PPDBv4 was also used for the analysis of the signals encoded in experimental inter-faces (see above). Although this analysis conceptually inspired the detection strategies imple-mented in JET2, the method was not trained on PPDBv4 and no JET2 parameter was set basedon PPDBv4 analysis. Hence, PPDBv4 could be used for assessing JET2 predictive power.

Lowly conserved interacting residues are found in the antigen-binding sites of the 25 anti-bodies from PPDBv4 (Fig 1C). JET2/iJET2 dramatically improves the detection of these inter-faces compared to JET/iJET (Table 1 and Fig 4A). The scoring scheme SC3 enables to increasesensitivity by almost 50 and precision by more than 30 for this group of proteins. The cases ofthe Fab fragment of an anti-VEGF antibody (1BJ1:R) and the shark new antigen receptorPBLA8 (2I25:R) illustrate the power of SC3 in predicting with high precision (PPV = 72% and60%) antigen-binding sites while iJET failed to correctly detect them (Fig 5).

Fig 3. Examples of interaction sites predicted by the three scoring schemes. The experimental complexes formed between the proteins of interest (lightgrey) and their partners (dark grey) are represented as cartoons. The experimental and predicted binding sites are displayed as opaque surfaces. Theexperimental interface residues are colored according to TJET values computed by iJET2. iJET2 predictions were obtained from a consensus of 2 runs out of10. They are colored according to the scoring scheme from which they were obtained: SC1 in orange, SC2 in purple and SC3 in cyan. The scoring schemesare indicated for each protein. They were automatically chosen by iJET2 clustering algorithm (first round). VORFFIP predictions are colored in dark green.

doi:10.1371/journal.pcbi.1004580.g003

Local Geometry and Conservation in Protein-Protein Interfaces

PLOS Computational Biology | DOI:10.1371/journal.pcbi.1004580 December 21, 2015 7 / 32

D-SC3

D-SC2

D-SC1

2DP6

:A

+Complete procedure

sens=87 PPV=83 F1=85

sens=67 PPV=95 F1=78

sens=64 PPV=88 F1=74

sens=86 PPV=41 F1=55

Figure 3. Examples of DNA-binding sites predicted by iJET2DNA. In the first column, the experimental complexes formed between the proteins of interest

(greyscaled colored chains) and the DNA (orange) are represented as cartoons. In second and third columns, the experimental and predicted DNA-binding sites aredisplayed as opaque surfaces, respectively. The experimental interface residues are colored according to conservation levels (TJETvalues) computed by iJET2

DNA.iJET2

DNA predictions, obtained from a consensus of 2 runs out of 10, are colored according to the scoring scheme: D-SC1 in orange, D-SC2 in dark green andD-SC3 in blue.

complete procedure where patches predicted in a first roundare complemented by patches predicted in a second roundand using a complementary scoring scheme (Fig. 1c). Anexample of prediction is given in Fig. 3 (bottom panel). Itmay also happen that the DNA-binding site is overlapping orin close proximity to a small ligand binding pocket. JET2

DNAimplements a procedure to specifically avoid these pockets(see Materials and methods). The impact of such procedurecan be appreciated on Supp. Fig. S16.

Overall assessment of JET2DNA performance

We report results obtained using the iterative mode of JET2DNA

(iJET2DNA, see Materials and Methods), with a consensus

of 2 and 8 iterations out of 10. For each protein, the bestpredicted patch or combination of patches among the threescoring schemes was retained for the evaluation. iJET2

DNAreaches an average F1-score of 61% on HR-PDNA187 andof 58-59% on a subset of 82 proteins (HOLO-APO82) forwhich unbound forms are available (Supp. Table S5 and Fig. 4,on top). The performance is similar on bound and unboundforms, indicating that JET2

DNA is robust to conformationalchanges associated with DNA-binding. For the vast majority

of proteins (82%), the sensitivity attained on the unbound formis at least as high as 90% of the sensitivity on the boundform, and in some cases it is even higher than 100%. Hence,JET2

DNA is able to detect interacting residues even whenthey are ’hidden’ by conformational changes. Moreover, heextent of the conformational change is not correlated with thedifference in performance (Supp. Fig. 17-19). The predictionsare also robust to stoichiometry changes (Supp. Table S6).Varying the consensus threshold enables shifting the balancebetween sensitivity (Sens) and precision (or predictive positivevalue, PPV), such that the lower threshold (2/10) yields moreextended predictions while the higher threshold (8/10) yieldsmore precise ones. The seeds on their own cover about a thirdof the experimental interfaces and contain very few positives(Supp. Table S7). On an independent test set comprising bothDNA- and RNA-binding sites (TEST-NABP82), iJET2

DNAachieves an average F1-score of 45 (Supp. Table S8 andFig. 4, at the bottom). The sensitivity is roughly the sameas for HR-PDNA187 but the precision (PPV) is lower. Thiscan be explained by the fact that the experimental referenceinterfaces are defined using a much more stringent distancecutoff (see Materials and Methods). iJET2

DNA detects equally

.CC-BY-NC-ND 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which wasthis version posted August 22, 2019. ; https://doi.org/10.1101/743617doi: bioRxiv preprint

Page 8: Multiple protein-DNA interfaces unravelled by evolutionary ...Aug 22, 2019  · Nucleic Acid Database (34) (February 2016 release). This list was filtered using PISCES sequence culling

ii

“JET2DNA” — 2019/8/22 — 11:28 — page 8 — #8 ii

ii

ii

8 bioRxiv, 0000, Vol. 00, No. 00

Figure 4. Comparison of iJET2DNA, DISPLAR, multiVORFFIP and DRNApred performance. For iJET2

DNA, consensus predictions were obtained from2 (in light green) and 8 (in dark green) runs out of 10. Top panel. HR-PDNA187 and APO82. To fairly assess DISPLAR, multiVORFFIP and DRNAPredperformances, the proteins used for training these methods were removed from HR-PDNA187. We used a sequence identity cutoff of 95% and ended up with106 (*), 87 (**) and 42 (***) proteins, respectively. Bottom panel. TEST-NABP82 and its subsets. TEST-DBP49 and TEST-DBP24 contain only DNA-bindingsites, while TEST-RBP33 contains only RNA-binding sites. TEST-DBP24 and TEST-RBP33 contain only proteins sharing less than 30% sequence identity withproteins from HR-PDNA187 (see Materials and Methods). DRNApred and multiVORFFIP were evaluated on their RNA-specific algorithms for the RNA-bindingproteins, and on their DNA-specific ones for the DNA-binding proteins. iJET2

DNA and DISPLAR, which do not have a specific algorithm to predict RNA-bindingsites, were evaluated on the same algorithm for both DNA- and RNA-binding proteins.

well DNA- and RNA-binding sites (Fig. 4, at the bottom,compare TEST-DBP49/24 and TEST-RBP33).

Discovery of alternative DNA-binding sitesThe complete and/or automated mode(s) of JET2

DNA producepatches that do not match the experimental interfaces presentin our datasets (Supp. Table S9). Due to the relatively smallnumber of protein-DNA complexes available in the PDB, itis difficult to systematically assess the pertinence of theseadditional predictions. Nevertheless, we could identify severalcases where the protein binds to DNA via two completelydistinct binding sites. We report four such cases in thefollowing (Fig. 5 and see Supp. Table S10 for F1 values).

1. RNA polymerase from bacteriophage T7. This proteinfirst binds the DNA promoter via its recognition site (26, 57)(site 1, Fig.5a, left), then undergoes a large conformationalchange and starts the transcription at its catalytic active site(27) (site 2, Fig.5b, left), where a short strand of RNA ispaired to the DNA strand. We considered the ensemble ofnucleic acid binding residues as experimental interface. Inboth structures, D-SC1 (and D-SC2 with comparable values)correctly detected the catalytic active site (Fig. 5a and b,right, orange) while D-SC3 predicted the recognition siteand some DNA-binding residues flanking the active site(Fig. 5a and b, right, blue). The fact that the two sitesare detected by two different D-SC indicates that they arecharacterized by different properties, which correlate withtheir respective functions. Indeed, the active site is genericwhile the recognition site is specific of this protein family(26). The very high conservation signal of the former masksthe weaker signal of the latter (Fig. 5, compare colors on a andb, left), which is still detectable based on the other properties.

2. N-terminal domain of the adeno-associated virusreplication protein. This protein contains three distinct DNA-binding sites (58): a stem loop sequence specific binding site(site 1, Fig. 5c, left), a tetranucleotide repeat recognition site(site 2, Fig. 5d, left), and the Tyr153 active site (no PDBstructure). D-SC1 and D-SC3 lead to accurate predictions ofsite 1 and site 2 (Fig. 5c and d), which both display rather lowconservation signal and are mostly protruding/exposed.

3. Modification-dependent restriction endonuclease. Inthe original structure from HR-PDNA187 (Fig. 5e, left), theDNA is bound only to the winged-helix domain bindingsite (59) (site 1), while the catalytic domain (site 2) ispartially disordered. In the alternative structure (Fig. 5f,left), both DNA-binding sites are occupied. Although therelative domain orientations differ drastically between the twostructures, JET2

DNA accurately identified the two binding sitesby combining D-SC1 with D-SC3 (Fig. 5e and f, right, inorange and in blue). The fact that a combination of D-SC isrequired reflects the heterogeneity of the conservation signalwithin each site (Fig. 5e and f, left, colored by conservationlevel). D-SC3 enables rescuing lowly conserved subregionsthat D-SC1 is not able to detect.

4. Cyclic GMP-AMP synthase. DNA binding to thisprotein is associated to a catalytic reaction producing acyclic dinucleotide from ATP and GTP. The protein wassolved bound to DNA both in monomeric (Fig. 5g, left(60)) and dimeric (Fig. 5h, left, (28)) forms. Since the smallligand-binding active site is much more conserved than theDNA-binding sites, this case is particularly difficult. In themonomeric form, only D-SC3 is able to identify the two DNA-binding sites (Fig. 5g, right, in blue). Upon dimerization,the interface geometry changes such that the two sites forma large concave region (Fig. 5h, left), detected by D-SC2.

.CC-BY-NC-ND 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which wasthis version posted August 22, 2019. ; https://doi.org/10.1101/743617doi: bioRxiv preprint

Page 9: Multiple protein-DNA interfaces unravelled by evolutionary ...Aug 22, 2019  · Nucleic Acid Database (34) (February 2016 release). This list was filtered using PISCES sequence culling

ii

“JET2DNA” — 2019/8/22 — 11:28 — page 9 — #9 ii

ii

ii

bioRxiv, 0000, Vol. 00, No. 00 9

Complementing the prediction with D-SC3 further improvesthe prediction (Fig. 5h, right, in dark green and in blue).

These examples illustrate the different usages of JET2DNA

multiple scoring schemes and how one can learn abouteach DNA-binding site properties from JET2

DNA predictions.Specifically, different D-SC may target different binding sites(recognition or catalytic), or different regions of the samebinding site (conserved core or protruding rim), or evenyield overlapping predictions. In all cases, JET2

DNA detectionproved to be robust to extensive conformational changes andstoichiometry changes.

Comparison with other toolsWe compared JET2

DNA with three other state-of-the-artmachine learning based predictors, namely DISPLAR (7),multiVORFFIP (15) and DRNApred (18) (see Materials andMethods). iJET2

DNA achieves ∼20% higher sensitivity andsimilar accuracy compared to DISPLAR and multiVORFFIPon all the datasets (Supp. Tables S5,8 and Fig. 4). Thepredictions are slightly less precise (by∼5-10%) than those ofmultiVORFFIP on HR-PDNA187** and APO82, but equallyor more precise than DISPLAR and multiVORFFIP on theother datasets (Supp. Tables S5,8 and Fig. 4). This resultsin ∼5-10% higher F1 values and most of the times ∼5-10%higher MCC values compared to both tools (Supp. Tables S5,8and Fig. 4). Compared to DRNApred, iJET2

DNA’s detectionis at least twice more sensitive and at least ∼10% moreprecise on all the datasets (Supp. Tables S5,8 and Fig. 4).We should stress that DRNApred was designed to specificallydiscriminate DNA- from RNA-binding residues, which is notthe purpose of JET2

DNA. In particular, JET2DNA performs much

better than the three other tools on the RNA-binding proteinsof TEST-RBP33 (Supp. Table S8 and Fig. 4). Moreover,when looking at the four above mentioned cases of multipleDNA-binding sites, JET2

DNA proved more powerful to detectalternative binding sites compared to the three other tools(Supp. Fig. 20-22 and Supp. Table S10).

Comparison with other benchmarksWe compared HR-PDNA187 and HOLO-APO82 with themost popular benchmarks, namely PDNA62 (6, 61), DBP374(13), MetaDBsite316 (51), Displar264 (7), PDDB1.2 (62),DBP206 (63), PDNA224 (10) and DNABINDPROT54 (24).Contrary to our datasets, most of them comprise only singlechains, even if the functional biological unit of the proteinin complex with DNA is annotated as a multimer. Moreover,when applying to them the PISCES criteria used to constructHR-PDNA187, the number of resulting complexes wassystematically smaller than 187 (Supp. Table S11). Finally,very few of them provide the APO forms of the proteins.

Robustness of the results to parameter changesJET2

DNA parameters were set empirically based on ourprevious experience with JET and JET2, our intuition andour analysis of the properties encoded in the experimentalinterfaces from HR-PDNA187. However, we should stress thatwe did not use machine learning to infer them. Hence, HR-PDNA187 is not a training set per se. Moreover, JET2

DNAparameters are few, physically explained and biologically

intuitive. To validate our choices, we conducted a thoroughanalysis of the impact of varying parameters on the predictions(Supp. Fig. S4-15). In total, we ran about 220 000 additionalJET2

DNA calculations to assess the influence of sevenparameters. These include the thresholds (% of residuesand CVlocal) used in the detection of small ligand pockets(Supp. Fig. S8-15), the residue and cluster thresholds used todetect and grow patches (Supp. Fig. S4-7) and the confidencethreshold used to filter out small patches (Supp. Fig. S5-7).This analysis showed that our choice of parameters is relevant,as the performances obtained with our default values are closeto the best one can expect within the JET2

DNA framework.Moreover, we found that our default values are consistentwith the intervals/regions they fall in and that there are noabrupt changes in performance within these regions. Hence,our predictions are stable to small parameter changes.

DISCUSSION

We have collected and carefully curated 187 high resolutionprotein-DNA complexes representative of all known typesof protein-DNA interactions. This new dataset, supplementedby the 82 available protein unbound conformations, couldserve as a reference benchmark for the community. Based onthe analysis of the evolutionary and structural properties ofthe DNA-binding sites from this dataset, we have proposedan original method to predict them. Importantly, we do notestimate interacting probabilities for individual residues, butwe predict ensembles of residues proximal in 3D space,in other words ”patches”. Predictions based on patchesare justified by the fact that residues being conserved ordisplaying specific physico-chemical properties at protein-DNA interfaces tend to cluster together (55, 64, 65).Moreover, we propose a ”discretized model” of interfacesdescribing the role of each predicted residue in the interface.This model is inspired by the support-core-rim model definedfor protein-protein interfaces. We have shown that it isalso useful for protein-DNA interfaces, where the supportand the core may switch their positions. We have definedthree archetypal protein-DNA interfaces, namely conservedgeneric, conserved enveloping and not conserved, and havedevised three scoring schemes to detect them.

We have thoroughly assessed our method’s performanceon bound and unbound protein forms and on an independentdataset. JET2

DNA compares favorably with machine learningbased state-of-the-art methods. Specifically, the predictionsare more sensitive and remarkably robust to extensiveconformational changes and to stochiometry changes.Moreover, we have highlighted several cases where the samecomplex was solved in different conditions or the sameprotein was able to bind DNA via different locations. JET2

DNAwas able to detect the ’alternative’ binding sites despitemajor conformational changes between the different structuresbound to DNA. A single crystallographic structure may revealonly one site or may even comprise a truncated or misplacedDNA, resulting in a ”partial” associated binding site. In thiscontext, JET2

DNA can be used as a mean to get a more realisticdescription of known binding sites, when those are onlypartially covered experimentally, and to discover yet unknownbinding sites.

.CC-BY-NC-ND 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which wasthis version posted August 22, 2019. ; https://doi.org/10.1101/743617doi: bioRxiv preprint

Page 10: Multiple protein-DNA interfaces unravelled by evolutionary ...Aug 22, 2019  · Nucleic Acid Database (34) (February 2016 release). This list was filtered using PISCES sequence culling

ii

“JET2DNA” — 2019/8/22 — 11:28 — page 10 — #10 ii

ii

ii

10 bioRxiv, 0000, Vol. 00, No. 00

HR-PDNA187 cplx(a) (b)

iJET2DNA iJET2DNAAlternative cplxsite 1 site 1

site 2 site 2

(c) (d)site 1 site 1

site 2site 2

site 1

site 2site 2

site 2Asite 2

PDB: 1CEZ PDB: 1MSW

PDB: 1UUT PDB: 1RZ9

(e)

(g)

PDB: 4ESJ PDB: 4KYW

(f)

(h)

PDB: 4K98 PDB: 4LEY

site 2B

D-SC1 D-SC3D-SC2Mapped alternative interface

JET2also

allowsthe

userto

manually

chooseaparticularscoring

scheme.In

thefollow

ingwe

reportJET2perform

anceon

anum

berofdifferentinterfaces.

Detecting

lowlyconserved

interfaceresidues

basedon

theprotein

surfacelocalgeom

etryTodescribe

thelocalgeom

etryofthe

proteinsurface,w

euse

themeasure

ofcircularvariance

(CV)thatevaluates

thedensity

ofproteinaround

anatom

.This

simple

geometric

descriptorcaptures

thestructuralproperties

ofinteractingresidues.T

oproperly

assessthe

predictiverole

ofCV,w

ecom

paredJET

2anditsiterative

versioniJET

2(seeMaterials

andMethods)to

JET/

iJET,w

hichuse

onlysequence

information.W

eapplied

bothmethods

totwotesting

sets,nam

elyPPD

Bv4and

theHuang

datasetof62protein

complexes

[20](S1Table).O

neshould

noticethatPPD

Bv4

was

alsoused

forthe

analysisofthe

signalsencoded

inexperim

entalinter-faces

(seeabove).A

lthoughthis

analysisconceptually

inspiredthe

detectionstrategies

imple-

mented

inJET

2,themethod

was

nottrainedon

PPDBv4

andno

JET2param

eterwas

setbasedon

PPDBv4

analysis.Hence,PPD

Bv4could

beused

forassessing

JET2predictive

power.

Lowlyconserved

interactingresiduesare

foundin

theantigen-binding

sitesofthe

25anti-

bodiesfrom

PPDBv4

(Fig1C

).JET2/iJET

2dramatically

improves

thedetection

oftheseinter-

facescom

paredto

JET/iJET

(Table

1and

Fig4A

).The

scoringschem

eSC

3enables

toincrease

sensitivityby

almost50

andprecision

bymore

than30

forthis

groupofproteins.T

hecases

ofthe

Fabfragm

entofananti-V

EGFantibody

(1BJ1:R)and

theshark

newantigen

receptorPBLA

8(2I25:R

)illustratethe

power

ofSC3in

predictingwith

highprecision

(PPV=72%

and60%

)antigen-bindingsites

while

iJETfailed

tocorrectly

detectthem(Fig

5).

Fig3.E

xamples

ofinteractionsites

predictedby

thethree

scoringschem

es.Theexperim

entalcomplexes

formed

between

theproteins

ofinterest(lightgrey)and

theirpartners(dark

grey)arerepresented

ascartoons.The

experimentaland

predictedbinding

sitesare

displayedas

opaquesurfaces.The

experimentalinterface

residuesare

coloredaccording

toTJE

Tvalues

computed

byiJE

T2.iJE

T2predictions

were

obtainedfrom

aconsensus

of2runs

outof10.They

arecolored

accordingto

thescoring

schemefrom

which

theywere

obtained:SC1inorange,S

C2inpurple

andSC3incyan.The

scoringschem

esare

indicatedforeach

protein.Theywere

automatically

chosenby

iJET2clustering

algorithm(firstround).V

ORFFIP

predictionsare

coloredindark

green.

doi:10.1371/journal.pcbi.1004580.g003

LocalGeom

etryand

Conservation

inProtein-P

roteinInterfaces

PLO

SCom

putationalBiology

|DOI:10.1371/journal.pcbi.1004580

Decem

ber21,20157/32

0 0.2 0.4 0.6 0.8 1

site 1

site 1 site 1A

site 1B

Figure 5. Prediction of multiple DNA-binding sites by iJET2DNA. For each protein (each row), the experimental complex present in HR-PDNA187 and another

experimental complex displaying a distinct DNA-binding site are shown on the first and third columns, respectively. For each structure, the ’main’ DNA-bindingsite is colored according to TJETvalues and the site coming from the other structure is mapped and displayed in transparent grey. The iJET2

DNA predictionscomputed on each experimental structure are displayed on the second and fourth columns, respectively, and colored according to the scoring scheme. (a-b) RNApolymerase from bacteriophage T7 (PDB: 1CEZ and 1MSW); (c-d) N-terminal domain of the adeno-associated virus replication protein (1UUT and 1RZ9); (e-f)R.DpnI modification-dependent restriction endonuclease (4ESJ and 4KYW); (g-h) Cyclic GMP-AMP synthase (cGAS) (4K98 and 4LEY). iJET2

DNA predictedpatches were obtained from a consensus of 2, for (a-b) and (e-f), or 5, for (c-d) and (g-h), runs out of 10. See Supp. Fig. 20-22 and Supp. Table S10 for comparisonwith other tools.

Beyond predicting DNA-binding sites, JET2DNA provides

a unique way to understand the origins and properties ofthese sites and interpret those in light of their functions.Transcription factors typically display single- or double-headed binding modes (66), with one or two highly conservedbinding sites, which are well detected by D-SC1. Enzymesusually have larger interfaces to accommodate an exposedrecognition site, detected by D-SC3, and a highly conserved

active site, detected by D-SC1, or highly segmented protein-DNA interfaces, where the protein interacts with the DNAthrough multidomain units in addition to their active site(25, 66, 67). Moreover, JET2

DNA is useful to unravel theheterogeneity of signals comprised within given bindingsites and to partition them in subregions displaying coherentproperties. The predictions can help designing or repurposingsmall molecules to target protein-DNA interfaces in an

.CC-BY-NC-ND 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which wasthis version posted August 22, 2019. ; https://doi.org/10.1101/743617doi: bioRxiv preprint

Page 11: Multiple protein-DNA interfaces unravelled by evolutionary ...Aug 22, 2019  · Nucleic Acid Database (34) (February 2016 release). This list was filtered using PISCES sequence culling

ii

“JET2DNA” — 2019/8/22 — 11:28 — page 11 — #11 ii

ii

ii

bioRxiv, 0000, Vol. 00, No. 00 11

intelligent way, e.g. specifically targeting the non-conservedsubregions to avoid side effects.

There is growing experimental evidence that proteinsurfaces are used in many different ways by many partners.For example, for protein-protein interactions, it is now clearthat several proteins can use the same region at the surface of apartner, possibly in different conformations, that binding sitescan be overlapping and that different binding site propertiesrelate to different interaction “types” or “functions” (32). Forprotein-DNA interactions, the available structural data is muchsmaller. Still, the examples we showed let us envision a muchlarger complexity in the usage of protein surfaces by DNAthan expected. Our work is inscribed in an effort to deciphersuch complexity.

Finally, we found that JET2DNA can be successfully applied

to the prediction of RNA-binding sites. This is in linewith the growing body of evidence showing that proteinsthat bind DNA are also likely to bind RNA. Indeed,although DNA-binding proteins used to be considered asfunctionally different from RNA-binding proteins and studiedindependently, this view has become outdated (29). Thereare many examples of proteins binding both nucleic acidseither via the same region at different times or simultaneouslyvia distinct regions. Some crystallographic structures ofcomplexes between proteins and hybrid D/RNA molecules arealso available in the PDB. Hence, discriminating DNA- fromRNA-binding residues is very challenging. The DRNApredmethod (18) represents a recent effort to address this issue. Wefound that it predicts very few residues, compared to JET2

DNA.A future development for JET2

DNA could be to include someRNA-specific features toward a better prediction of RNA-binding sites. It could be advantageous to include disorderinformation, an electrostatic potential in addition or replacingthe residue propensities and/or restricting the calculation ofthe evolutionary conservation to the subfamily associated tothe query protein. This could help in identifying family- oreven protein-specific interfaces.

ACKNOWLEDGEMENTS

We thank Chloe Dequeker for sharing some of her scripts, N.Fernandez-Fuentes for indicating good practice for using themultiVORFFIP method and DISPLAR’s authors for helpingus in running the tool.

FUNDING

LabEx CALSIMLAB (public grant ANR-11-LABX-0037-01 constituting a part of the ”Investissements d’Avenir”program - reference: ANR-11-IDEX-0004-02) (FC); theInstitut Universitaire de France (AC).

Conflict of interest statement. None declared.

REFERENCES

1. N. M. Luscombe, S. E. Austin, H. M. Berman, and J. M. Thornton, “Anoverview of the structures of protein-dna complexes,” Genome Biology,vol. 1, no. 1, pp. reviews001–1, 2000.

2. A. N. Bullock and A. R. Fersht, “Rescuing the function of mutant p53,”Nature Reviews Cancer, vol. 1, no. 1, p. 68, 2001.

3. A. S. Chen-Plotkin, V. M.-Y. Lee, and J. Q. Trojanowski, “Tar dna-bindingprotein 43 in neurodegenerative disease,” Nature Reviews Neurology,vol. 6, no. 4, p. 211, 2010.

4. H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat,H. Weissig, I. N. Shindyalov, and P. E. Bourne, “The protein data bank,”Nucleic Acids Research, vol. 28, no. 1, pp. 235–242, 2000.

5. S. Ahmad and A. Sarai, “Pssm-based prediction of dna binding sites inproteins,” BMC Bioinformatics, vol. 6, no. 1, p. 33, 2005.

6. S. Ahmad, M. M. Gromiha, and A. Sarai, “Analysis and prediction ofdna-binding proteins and their binding residues based on composition,sequence and structural information,” Bioinformatics, vol. 20, no. 4,pp. 477–486, 2004.

7. H. Tjong and H.-X. Zhou, “Displar: an accurate method for predictingdna-binding sites on protein surfaces,” Nucleic Acids Research, vol. 35,no. 5, pp. 1465–1477, 2007.

8. L. Wang and S. J. Brown, “Bindn: a web-based tool for efficient predictionof dna and rna binding sites in amino acid sequences,” Nucleic AcidsResearch, vol. 34, no. suppl 2, pp. W243–W248, 2006.

9. L. Wang, C. Huang, M. Q. Yang, and J. Y. Yang, “Bindn+ for accurateprediction of dna and rna-binding residues from protein sequencefeatures,” BMC Systems Biology, vol. 4, no. 1, p. S3, 2010.

10. T. Li, Q.-Z. Li, S. Liu, G.-L. Fan, Y.-C. Zuo, and Y. Peng, “Predna:accurate prediction of dna-binding sites in proteins by integratingsequence and geometric structure information,” Bioinformatics, vol. 29,no. 6, pp. 678–685, 2013.

11. S. Hwang, Z. Gou, and I. B. Kuznetsov, “Dp-bind: a web serverfor sequence-based prediction of dna-binding residues in dna-bindingproteins,” Bioinformatics, vol. 23, no. 5, pp. 634–636, 2007.

12. L. Wang, M. Q. Yang, and J. Y. Yang, “Prediction of dna-bindingresidues from protein sequence information using random forests,” BMCGenomics, vol. 10, no. 1, p. S1, 2009.

13. J. Wu, H. Liu, X. Duan, Y. Ding, H. Wu, Y. Bai, and X. Sun, “Predictionof dna-binding residues in proteins from amino acid sequences using arandom forest model with a hybrid feature,” Bioinformatics, vol. 25, no. 1,pp. 30–35, 2008.

14. G. Nimrod, M. Schushan, A. Szilagyi, C. Leslie, and N. Ben-Tal, “idbps: aweb server for the identification of dna binding proteins,” Bioinformatics,vol. 26, no. 5, pp. 692–693, 2010.

15. J. Segura, P. F. Jones, and N. Fernandez-Fuentes, “A holistic in silicoapproach to predict functional sites in protein structures,” Bioinformatics,vol. 28, no. 14, pp. 1845–1850, 2012.

16. J. Zhou, R. Xu, Y. He, Q. Lu, H. Wang, and B. Kong, “Pdnasite:identification of dna-binding site from protein sequence by incorporatingspatial and sequence context,” Scientific reports, vol. 6, p. 27653, 2016.

17. X. Zhu, S. S. Ericksen, and J. C. Mitchell, “Dbsi: Dna-binding siteidentifier,” Nucleic acids research, vol. 41, no. 16, pp. e160–e160, 2013.

18. J. Yan and L. Kurgan, “Drnapred, fast sequence-based method thataccurately predicts and discriminates dna-and rna-binding residues,”Nucleic acids research, vol. 45, no. 10, pp. e84–e84, 2017.

19. Z. Miao and E. Westhof, “Prediction of nucleic acid binding probabilityin proteins: a neighboring residue network based score,” Nucleic acidsresearch, vol. 43, no. 11, pp. 5340–5351, 2015.

20. Z. Miao and E. Westhof, “Rbscore&nbench: a high-level web server fornucleic acid binding residues prediction with a large-scale benchmarkingdatabase,” Nucleic acids research, vol. 44, no. W1, pp. W562–W567,2016.

21. Y. Tsuchiya, K. Kinoshita, and H. Nakamura, “Preds: a serverfor predicting dsdna-binding site on protein molecular surfaces,”Bioinformatics, vol. 21, no. 8, pp. 1721–1723, 2004.

22. Y. C. Chen, J. D. Wright, and C. Lim, “Dr bind: a web server forpredicting dna-binding residues from the protein structure based onelectrostatics, evolution and geometry,” Nucleic Acids Research, vol. 40,no. W1, pp. W249–W256, 2012.

23. M. Gao and J. Skolnick, “Dbd-hunter: a knowledge-based method for theprediction of dna–protein interactions,” Nucleic Acids Research, vol. 36,no. 12, pp. 3978–3992, 2008.

24. P. Ozbek, S. Soner, B. Erman, and T. Haliloglu, “Dnabindprot:fluctuation-based predictor of dna-binding residues within a networkof interacting residues,” Nucleic Acids Research, vol. 38, no. suppl 2,pp. W417–W423, 2010.

25. S. J. Wodak and J. Janin, “Structural basis of macromolecularrecognition,” Advances in Protein Chemistry, vol. 61, pp. 9–73, 2002.

26. G. M. Cheetham, D. Jeruzalmi, and T. A. Steitz, “Structural basis forinitiation of transcription from an rna polymerase–promoter complex,”

.CC-BY-NC-ND 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which wasthis version posted August 22, 2019. ; https://doi.org/10.1101/743617doi: bioRxiv preprint

Page 12: Multiple protein-DNA interfaces unravelled by evolutionary ...Aug 22, 2019  · Nucleic Acid Database (34) (February 2016 release). This list was filtered using PISCES sequence culling

ii

“JET2DNA” — 2019/8/22 — 11:28 — page 12 — #12 ii

ii

ii

12 bioRxiv, 0000, Vol. 00, No. 00

Nature, vol. 399, no. 6731, p. 80, 1999.27. Y. W. Yin and T. A. Steitz, “Structural basis for the transition from

initiation to elongation transcription in t7 rna polymerase,” Science,vol. 298, no. 5597, pp. 1387–1395, 2002.

28. X. Li, C. Shu, G. Yi, C. T. Chaton, C. L. Shelton, J. Diao, X. Zuo, C. C.Kao, A. B. Herr, and P. Li, “Cyclic gmp-amp synthase is activated bydouble-stranded dna-induced oligomerization,” Immunity, vol. 39, no. 6,pp. 1019–1031, 2013.

29. W. H. Hudson and E. A. Ortlund, “The structure, function and evolutionof proteins that bind DNA and RNA,” Nat. Rev. Mol. Cell Biol., vol. 15,pp. 749–760, Nov 2014.

30. K. Mierzejewska, W. Siwek, H. Czapinska, M. Kaus-Drobek,M. Radlinska, K. Skowronek, J. M. Bujnicki, M. Dadlez, and M. Bochtler,“Structural basis of the methylation specificity of r. dpni,” Nucleic AcidsResearch, vol. 42, no. 13, pp. 8745–8754, 2014.

31. E. Laine and A. Carbone, “Local geometry and evolutionary conservationof protein surfaces reveal the multiple recognition patches in protein-protein interactions,” PLoS Computational Biology, vol. 11, no. 12,p. e1004580, 2015.

32. C. Dequeker, E. Laine, and A. Carbone, “Decrypting protein surfaces bycombining evolution, geometry, and molecular docking,” Proteins, Jun2019.

33. E. D. Levy, “A simple definition of structural regions in proteins andits use in analyzing interface evolution,” Journal of Molecular Biology,vol. 403, no. 4, pp. 660–670, 2010.

34. H. M. Berman, W. K. Olson, D. L. Beveridge, J. Westbrook, A. Gelbin,T. Demeny, S.-H. Hsieh, A. Srinivasan, and B. Schneider, “The nucleicacid database. a comprehensive relational database of three-dimensionalstructures of nucleic acids,” Biophysical Journal, vol. 63, no. 3, pp. 751–759, 1992.

35. G. Wang and R. L. Dunbrack Jr, “Pisces: a protein sequence cullingserver,” Bioinformatics, vol. 19, no. 12, pp. 1589–1591, 2003.

36. C. Camacho, G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos,K. Bealer, and T. L. Madden, “Blast+: architecture and applications,”BMC Bioinformatics, vol. 10, no. 1, p. 421, 2009.

37. U. Consortium, “Uniprot: the universal protein knowledgebase,” NucleicAcids Research, vol. 45, no. D1, pp. D158–D169, 2016.

38. C. J. Chou, G. Wisedchaisri, R. R. Monfeli, D. M. Oram, R. K. Holmes,W. G. Hol, and C. Beeson, “Functional studies of the mycobacteriumtuberculosis iron-dependent regulator,” Journal of Biological Chemistry,vol. 279, no. 51, pp. 53554–53561, 2004.

39. S. Hubbard and J. Thornton, “Naccess, 2.1. 1,” Dept of Biochemistry andMolecular Biology: University College London, 1993.

40. J. Yan, S. Friedrich, and L. Kurgan, “A comprehensive comparativereview of sequence-based predictors of dna-and rna-binding residues,”Briefings in bioinformatics, vol. 17, no. 1, pp. 88–105, 2015.

41. K. Chen, M. J. Mizianty, J. Gao, and L. Kurgan, “A critical comparativeassessment of predictions of protein-binding sites for biologically relevantorganic compounds,” Structure, vol. 19, no. 5, pp. 613–621, 2011.

42. S. Engelen, L. A. Trojan, S. Sacquin-Mora, R. Lavery, and A. Carbone,“Joint evolutionary trees: a large-scale method to predict proteininterfaces based on sequence sampling,” PLoS Computational Biology,vol. 5, no. 1, p. e1000267, 2009.

43. I. Mihalek, I. Res, and O. Lichtarge, “A family of evolution–entropyhybrid methods for ranking protein residues by importance,” Journal ofMolecular Biology, vol. 336, no. 5, pp. 1265–1282, 2004.

44. O. Lichtarge, H. R. Bourne, and F. E. Cohen, “An evolutionary tracemethod defines binding surfaces common to protein families,” Journalof Molecular Biology, vol. 257, no. 2, pp. 342–358, 1996.

45. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang,W. Miller, and D. J. Lipman, “Gapped blast and psi-blast: a newgeneration of protein database search programs,” Nucleic Acids Research,vol. 25, no. 17, pp. 3389–3402, 1997.

46. B. Park, H. Kim, and K. Han, “Dbbp: database of binding pairs in protein-nucleic acid interactions,” BMC Bioinformatics, vol. 15, no. 15, p. S5,2014.

47. N. Ceres, M. Pasi, and R. Lavery, “A protein solvation model based onresidue burial,” Journal of Chemical Theory and Computation, vol. 8,no. 6, pp. 2141–2144, 2012.

48. M. Mezei, “A new method for mapping macromolecular topography,”Journal of Molecular Graphics and Modelling, vol. 21, no. 5, pp. 463–472, 2003.

49. F. Wilcoxon, “Individual comparisons by ranking methods,” BiometricsBulletin, vol. 1, no. 6, pp. 80–83, 1945.

50. T. W. Anderson, D. A. Darling, et al., “Asymptotic theory of certain”goodness of fit” criteria based on stochastic processes,” The annals ofmathematical statistics, vol. 23, no. 2, pp. 193–212, 1952.

51. J. Si, Z. Zhang, B. Lin, M. Schroeder, and B. Huang, “Metadbsite: a metaapproach to improve protein dna-binding sites prediction,” BMC SystemsBiology, vol. 5, no. 1, p. S7, 2011.

52. Z. Miao and E. Westhof, “A large-scale assessment of nucleic acidsbinding site prediction programs,” PLoS computational biology, vol. 11,no. 12, p. e1004639, 2015.

53. H. Hwang, T. Vreven, J. Janin, and Z. Weng, “Protein–proteindocking benchmark version 4.0,” Proteins: Structure, Function, andBioinformatics, vol. 78, no. 15, pp. 3111–3114, 2010.

54. S. Biswas, M. Guharoy, and P. Chakrabarti, “Dissection, residueconservation, and structural classification of protein-dna interfaces,”Proteins: Structure, Function, and Bioinformatics, vol. 74, no. 3, pp. 643–654, 2009.

55. S. Ahmad, O. Keskin, A. Sarai, and R. Nussinov, “Protein–dnainteractions: structural, thermodynamic and clustering patterns ofconserved residues in dna-binding proteins,” Nucleic Acids Research,vol. 36, no. 18, pp. 5922–5932, 2008.

56. N. M. Luscombe and J. M. Thornton, “Protein–dna interactions: aminoacid conservation and the effects of mutations on binding specificity,”Journal of Molecular Biology, vol. 320, no. 5, pp. 991–1009, 2002.

57. K. J. Durniak, S. Bailey, and T. A. Steitz, “The structure of a transcribingt7 rna polymerase in transition from initiation to elongation,” Science,vol. 322, no. 5901, pp. 553–557, 2008.

58. A. B. Hickman, D. R. Ronning, Z. N. Perez, R. M. Kotin, andF. Dyda, “The nuclease domain of adeno-associated virus rep coordinatesreplication initiation using two distinct dna recognition interfaces,”Molecular Cell, vol. 13, no. 3, pp. 403–414, 2004.

59. W. Siwek, H. Czapinska, M. Bochtler, J. M. Bujnicki, and K. Skowronek,“Crystal structure and mechanism of action of the n6-methyladenine-dependent type iim restriction endonuclease r. dpni,” Nucleic AcidsResearch, vol. 40, no. 15, pp. 7563–7572, 2012.

60. P. Gao, M. Ascano, Y. Wu, W. Barchet, B. L. Gaffney, T. Zillinger, A. A.Serganov, Y. Liu, R. A. Jones, G. Hartmann, et al., “Cyclic g (2’, 5’) pa(3’, 5’) p is the metazoan second messenger produced by dna-activatedcyclic gmp-amp synthase,” Cell, vol. 153, no. 5, pp. 1094–1107, 2013.

61. S. Selvaraj, H. Kono, and A. Sarai, “Specificity of protein–dna recognitionrevealed by structure-based potentials: symmetric/asymmetric andcognate/non-cognate binding,” Journal of Molecular Biology, vol. 322,no. 5, pp. 907–915, 2002.

62. M. Van Dijk and A. M. Bonvin, “A protein–dna docking benchmark,”Nucleic Acids Research, vol. 36, no. 14, pp. e88–e88, 2008.

63. Y. Xiong, J. Liu, and D.-Q. Wei, “An accurate feature-based method foridentifying dna-binding residues on protein surfaces,” Proteins: Structure,Function, and Bioinformatics, vol. 79, no. 2, pp. 509–517, 2011.

64. S. Dey, A. Pal, M. Guharoy, S. Sonavane, and P. Chakrabarti,“Characterization and prediction of the binding site in dna-bindingproteins: improvement of accuracy by combining residue composition,evolutionary conservation and structural parameters,” Nucleic AcidsResearch, vol. 40, no. 15, pp. 7150–7161, 2012.

65. S. Jones, H. P. Shanahan, H. M. Berman, and J. M. Thornton, “Usingelectrostatic potentials to predict dna-binding sites on dna-bindingproteins,” Nucleic Acids Research, vol. 31, no. 24, pp. 7189–7198, 2003.

66. S. Jones, P. van Heyningen, H. M. Berman, and J. M. Thornton, “Protein-dna interactions: A structural analysis1,” Journal of Molecular Biology,vol. 287, no. 5, pp. 877–896, 1999.

67. K. Nadassy, S. J. Wodak, and J. Janin, “Structural features of protein-nucleic acid recognition sites,” Biochemistry, vol. 38, no. 7, pp. 1999–2017, 1999.

.CC-BY-NC-ND 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which wasthis version posted August 22, 2019. ; https://doi.org/10.1101/743617doi: bioRxiv preprint