Post on 18-Dec-2015
5th Meeting on U.S. Government Chemical Databases and Open ChemistryFrederick, Maryland, August 25-26, 2011
The PDBbind Database: A Comprehensive Collection of the Binding Data and Structures of the Complexes in the Protein Data Bank
Renxiao Wang
Outline
What is the PDBbind database and why to develop it?
How is the PDBbind database compiled?
What information is provided by the PDBbind database?
Possible applications of the PDBbind database
2
3
Protein Data Bank
Biomolecularcomplexes
Complexes withbinding data
PDBbind web site
What is the PDBbind Database?
(1) Complexes formed between small-molecule ligands and biomacromolecules, and (2) those between biomacromolecules.
Structural information and binding data
http://www.pdbbind-cn.org/
4
Both structural and energetic information are indispensable for an in-depth understanding of the recognition between small molecules and biological macromolecules.
Why to Create the PDBbind Database?
It is especially important for the development and calibration of computational methods for the estimation of protein-ligand binding affinity.
5
Why to Create the PDBbind Database?
• Three-dimensional structures of biomolecular complexes are available from the Protein Data Bank :
• More than 74,000 structures have been deposited in PDB by Aug 1st, 2011. Nearly half of them are complexes of all types.
• However, binding affinity data of these complexes, if available, used to scatter in literature and thus are difficult to access.
• Before PDBbind, no other database has attempted to collect such binding data in a systematic manner.
The PDBbind database aims at providing a comprehensive collection of the binding data for all types of biomolecular complexes in PDB.
6
Why to Create the PDBbind Database?
The old approach: Assemble the data sets reported by other researchers.
For example, the X-Score scoring function was developed by using a set of 230 protein-ligand complexes with known binding data. This data set was compiled by assembling several smaller data sets reported previously, which was the largest collection of this type at that time.
Wang R. et al., J. Comput.-Aided Mol. Des. 2002, 16, 11-26.
Disadvantages of this approach
It is difficult to verify those binding data since original references are often not given: Some data are IC50 values; Some data are not binding affinity data; There are even typographical errors!
Regular updates are not possible.
7
History of the PDBbind Database
(1) Wang, R. et al. J. Med. Chem. 2004, 47, 2977-2980. (2) Wang, R. et al. J. Med. Chem. 2005, 48, 4111-4119. (3) Cheng, T. et al. J. Chem. Inf. Model. 2009, 49, 1079-1093.
Apr, 2001: Preliminary trial & launch of the project (University of Michigan)
May, 2004: PDBbind v.2004 was publicly released at http://www. pdbbind.org/ (University of Michigan)
Nov, 2007: The PDBbind-CN server was launched at http://www. pdbbind-cn.org/ (Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences)
Aug, 2011: The current version (v.2011), providing binding data for ~8,000 complexes in PDB
v.2005 and v.2006 were released.
v.2008, v.2009, and v.2010 were released.
7
8
How is the PDBbind Database Compiled?
The entire PDB
Biomolecularcomplexes
Complexes withbinding data
Integrate into the PDBbind
web site
II. Collection of binding data from original references
I. Classification of complexes
III. Data processing & web design
9
Step I. Classification of Biomolecular Complexes
The entire classification process is automated by a set of computer programs.
A given PDB entry
Contain a protein?
Contain a nucleic acid?
Contain a nucleic acid?
Contain a small molecule?
Contain a small molecule?
Protein-proteincomplex
Protein-nucleicacid complex
Protein-ligandcomplex
Nucleic acid-ligand complex
Misc. oligomer
Contain two proteins?
Apo- nucleic acid
Apo-protein
YES
NO
YES
YES
YES
YES
YES
NO
NO
NO
NO
NO
10
1570 (2.25%)620 (0.89%)
4580 (6.58%)
2912 (4.18%)
7067 (10.15%)
22147 (31.8%)
30744 (44.15%)
apo-proteins
protein-ligand complexes
special protein-ligand complexes(cofactor-containing)
protein-nucleic acid complexes
protein-protein complexes
nucleic acid-ligand complexes
apo-nucleic acids
* Based on the PDB contents released by Jan 1st 2011, 70,224 entries in total
Classification of the Entire Protein Data Bank
11
Binding affinity data of a given complex could be reported or cited in the “primary citation” of the PDB file (success rate 30%).
Step II. Collection of Binding Data from Literature
12
Accepted binding affinity data include dissociation constants (Kd), inhibition constants (Ki), and concentrations at 50% inhibition (IC50).
A computer program is developed to process PDF files, filtering out the papers containing no binding data.
Each remaining paper is then examined independently by two persons. Consensus must be reached before the binding data are recorded.
Collection of Binding Data from Literature
13
Collection of Binding Data from Literature
• Over 17,800 references have been processed so far.
• Each primary reference is saved as a PDF file, in which the binding data are clearly marked.
• Mistakes are still possible during manual data curation. Nevertheless, >98% of the binding data in PDBbind are correct.
The primary reference for PDB entry 1BXO
14
Small Molecules
Proteins6,070 1,427
428NucleicAcids
66
Outcomes of Binding Data Collection
PDBbind v.2011 includes binding data for 7,991 complexes in PDB.
Proteins
15
Updates of the PDBbind Database
Version Entries in PDB
PDBbind
Valid complexes
Complexes with binding data
Refined set Core set
2004 28,991 6,847 2,276 1,091 231
2005 34,338 9,775 2,756 1,296 288
2006 34,338 9,775 2,632 1,122 234
2007 40,876 11,822 3,124 1,300 195
2008 48,092 18,211 4,300 1,401 210
2009 55,069 23,284 5,678 1,741 219
2010 62,387 26,434 6,772 2,061 231
2011 70,224 30,259 7,991 2,476 243
It is critical to update PDBbind regularly to keep up with the constant growth of PDB. PDBbind is now updated annually, and it grows by 20-30% each year.
16
Browse information
Search information
Download information
Depositbinding data
RCSB PDBRCSB PDB
StructuresStructures Binding DataBinding Data
Biomolecular complexes in PDBBiomolecular complexes in PDB
PDBsumPDBsum
PubMedPubMed
PubChemPubChem
http://www.pdbbind-cn.org/
Step III. Build the PDBbind-CN Web Site
PDBbind-CN
17The basic information of each complex is summarized on a single page.
On-line Information @ PDBbind-CN
18
Multiple display modes are provided by ChemAxon and JMol Java applets on the web interface of PDBbind-CN.
19
Various types of queries may be used in the searching of binding data. Results are given in well-organized forms, which can be output in either the PDF format or the Excel format.
On-line Search @ PDBbind-CN
20
On-line Search @ PDBbind-CN
Substructure/similarity search among the small-molecule ligands in all protein-ligand complexes in PDB (>12,000 entities), not limited to those with known binding data.
21
On-line Search @ PDBbind-CN
Similarity search among all protein and nucleic acid sequences in PDB, not limited to those with known binding data.
22
What can be downloaded from PDBbind-CN?
Tables of binding data for all categories of complexes.
“Clean” structural files of most of the protein-ligand complexes with known binding data (6,023 in v.2011), which can be readily utilized by most molecular modeling software.
• A complete “biological unit” of each complex is split into a protein molecule and a ligand molecule.
• The protein molecule is saved in the PDB format and the ligand molecule is saved in the SYBYL Mol2 format after necessary processing.
The “refined set” and the “core set” of selected protein-ligand complexes, providing a high-quality benchmark for docking/scoring studies.
23
Selection of the Refined Set
The refined set consists of protein-ligand complexes meeting higher standards:
• Concerns on quality of the structure: crystal structures with resolution<2.5 Å & R-factor<0.250; both the protein and the ligand structures need to be complete.
• Concerns on quality of the binding data: Binding data are given in Kd or Ki; and 2.0<-logKd <12.0 (i.e. Kd=10mM~1pM); binding data cannot be an estimated value; the protein as well as the ligand used in the binding assay need to match exactly the ones observed in the crystal structure.
• Concerns on nature of the complex: must be non-covalent binding; must be binary complex; ligand MW<1000; ligand does not contain B, Be, Si, and metal atoms.
In v.2011, a total of 2,476 protein-ligand complexes are selected into the refined set, accounting for 41% of all of the protein-ligand complex with known binding data.
24
Selection of the Core Set
In v.2011, the core set consists of a total of 81 families of 243 protein-ligand complexes. The core set will be controlled under 300 complexes (100 families) in the future.
Clustering
Selection
The refined set (2,476)
The core set (243)
25
Selection of the Core Set
The core set is selected to provide a representative, non-redundant sampling of the refined set, so that serves better as a benchmark for validating docking/scoring approaches.
1. Clustering: Group the protein-ligand complexes in the refined set into families by protein sequence similarity (cutoff = 90%).
2. Selection of clusters: Only consider the families that have at least 5 members. The highest binding affinity in each valid family must be at least 100-fold higher than the lowest binding affinity.
3. Selection of representatives: For each remaining family, select the complex with the highest binding affinity (the “topper”), the lowest binding affinity (the “lower”), and the one closest to the mean value (the “middler”) as the representatives of this family.
Methods
26
Possible Applications of the PDBbind Database
According to our literature survey, 30~40 applications of the PDBbind database are published each year.
Provide high-quality data sets for theoretical and computational studies on molecular recognition
– Binding data available for protein-ligand, protein-protein, and protein-nucleic acid complexes
– Specially compiled “refined set” and “core set”
Provide useful clues to medicinal chemists and other researchers for the discovery of bioactive small-molecule compounds or potential targets
27
What ligands bind to it ?
What high-affinity ligands look like ? What low-affinity
ligands look like?
If these chemical moieties may interact with other proteins(new targets or side effects) ?
Critical chemical moieties(pharmacophore)
A known target
28
To Build an Integrated Platform for Data Mining
Protein Data BankProtein Data Bank
3D Structures3D Structures
Pharmacophore Models
Pharmacophore ModelsChemical
DatabasesChemical
DatabasesUseful
HitsUseful
Hits
Data Mining Tools
Docking
Binding site analysis
Scoring
Data Compilation
Binding AffinityData
Binding AffinityData
PharmaceuticalImplications
PharmaceuticalImplications
Protein-Ligand & Nucleic Acid-Ligand ComplexesProtein-Ligand & Nucleic Acid-Ligand Complexes
ComplexFamiliesComplexFamilies
29
Answer to FAQ
Why does not PDBbind provide experimental details in addition to the binding data?
• Such information is not always given in the reference.
• Of course it takes a lot of extra efforts to retrieve such information, and it is difficult to format such information.
• The users can always check the original reference if they really need such information.
30
What is the difference between PDBbind and Binding MOAD?
Binding MOAD also collects the binding data of protein-ligand complexes, which is also based on a systematic mining of the Protein Data Bank.
Thus, the contents of Binding MOAD overlap with part of PDBbind, and these two databases are similar in some technical aspects.
Binding MOAD (Mother Of All Databases) was independently developed by Prof. Heather Carlson’s group at the University of Michigan, and was released to the public in 2005.
Proteins: Structure, Function, and Bioinformatics, Volume 60, Issue 3, pages 333–340, 15 August 2005.
Answer to FAQ
31
Summary: Significance of the PDBbind Database
– More binding data: The latest version provides binding data for ~8,000 complexes• Systematic mining of the entire PDB• Covering all major categories of biomolecular complexes, not only
for selected protein-ligand complexes
– Better in quality• Reasonable classification of biomolecular complexes • Binding affinity data carefully collected from original references
– Regularly updated since the first public release in 2004. Binding data increase by 20~30% each year.
– Widely popular: User-friendly web interface; over 1,600 registered users from some 40 countries across the world.
32
The Natural Science Foundation of China (NSFC), the Ministry of Sciences and Technologies of China (MOST), and the Chinese Academy of Sciences (CAS).
Acknowledgments
Special thanks to Prof. Shaomeng Wang and his group at the University of Michigan!
Liu,Z. Li,J. Li,Y. Li,X. Lin,F.
Thanks to the following contributors in my group:
33
34
Why to Create the PDBbind Database?
Protein-small molecule binding
Protein-nucleic acid binding
Protein-protein binding
Recognitions and Interactions between various types of molecules are
essential at the molecular level for various biological processes.
35
As a matter of fact, most PDB entries contain multiple heterogen molecules in addition to the primary molecule (protein or nucleic acid).
Is this a meaningful protein-ligand complex?
Difficulty in Complex Classification
36
Difficulty in Complex Classification
What are classified as “valid” small-molecule ligand molecules:
• “Regular” organic molecules • Oligo-peptides containing < 10 amino acid residues• Oligo-nucleic acids containing < 4 nucleotides
What are classified as “special” ligand molecules:
• Cofactors/coenzymes: CoA, NAD, FAD, Heme & their derivatives
• Inorganic species• Organic solvents and buffer components• Saccharide molecules with high occurrences
What are classified as “junk” molecules:
37
Difficulty in Complex Classification
Is this a protein-protein complex or a protein-ligand complex?
A complex may be classified into more than one category.
Protein A Protein B
Small-molecule ligand
38
Shanghai Inst. Org. Chem.
www.pdbbind.org
www.pdbbind-cn.org
The PDBbind database has >1,600 registered users all over the world by now.
Univ. Michigan
39
Process the Complex Structures
1. Split a complete “biological unit” of each complex into a protein molecule and a ligand molecule.
2. Save the protein molecule in the PDB format.
– Remove redundant structural units;– Add hydrogen atoms;– Keep the water and metal ions with the protein.
3. Save the ligand molecule in the Mol2 format.
– Correct atom types and bond types– Add hydrogen atoms and partial charges– Handle tautomers correctly
These processed structural files can be readily utilized by most molecular modeling software.
40
The PDBbind-CN Database蛋白 -配体复合物三维结构
及亲合性数据库
发展亲合性打分函数Scoring Function
Development
评估亲合性打分函数Scoring Function
Assessment
My Scoring Function Tripod
(1) J. Comput.-Aided Mol. Des. 2002, 16, 11-26. (2) J. Med. Chem. 2003, 46, 2287-2303. (3) J. Med. Chem. 2004, 47, 2977-2980. (4) J. Chem. Inf. Comput. Sci. 2004, 44, 2114-2125. (5) J. Med. Chem. 2005, 48, 4111-4119. (6) Proteins, 2006, 64, 1058-1068. (7) J. Chem. Theory Comput. 2008, 4, 1959–1973. (8) J. Chem. Inf. Model. 2009, 49, 1079-1093. (9) J. Chem. Inf. Model. 2009, 49, 1033-1048. (10) Mol. Informatics, 2010, 29,87-96. (11) J. Comp. Chem. 2010, 31, 2109-2125. (12) BMC Bioinformatics, 2010, 11, 193-208. (13) J. Chem. Theory Comput. 2010, 6, 1852-1870. (14) J. Chem. Inf. Model., 2010, 50 , 682–1692.
41
1E1VKi = 12000 nM
1E1XKi = 1300 nM
1PXPKi = 220 nM
1PXOKi = 2.0 nM
1PXNKi = 70 nM
Some CDK-2 inhibitors recorded in PDBbind