PPSC version 1.0 User Guidestructure.bmc.lu.se/PPSC/files/User_guide_PPSC.pdf3 3 Functions...
Transcript of PPSC version 1.0 User Guidestructure.bmc.lu.se/PPSC/files/User_guide_PPSC.pdf3 3 Functions...
PPSC version 1.0
User Guide
Index
1 Introduction.................................................................................................................... 1
2 Runtime environment ................................................................................................... 2
3 Functions......................................................................................................................... 3
3.1 Download PDB File ............................................................................................... 3
3.2 Calculate CE ........................................................................................................... 4
3.3 Train model ............................................................................................................. 5
3.4 Predict stability ...................................................................................................... 6
4 Input and output files ................................................................................................... 7
4.1 Formats of input files ............................................................................................ 7
4.2 Output files ............................................................................................................. 9
5 Example .......................................................................................................................... 9
6 Refernces....................................................................................................................... 12
1
1 Introduction
Many amino acid substitutions affect the stability and biological functions of proteins.
Understanding the effects of these alterations facilitates the elucidation of the
molecular basis of many diseases (Thusberg and Vihinen, 2009). Site-directed
mutagenesis has been utilized for decades (Kearns-Jonker et al., 2007, Rajendhran
and Gunasekaran, 2007), but the experimental trail and error-based design and
construction of mutations is time-consuming and expensive, while its success rate in
protein character design and alteration is low. Experimental methods are tedious and
often costly, so computational methods are often used to predict stability changes
caused by amino acid substitutions whether related to analysis of disease related
variations or to design substitutions for protein engineering.
Several methods have been developed to predict protein stability changes based on
information about the primary sequence or protein three-dimensional structure. There
are energy function-based methods and machine learning-based methods. The energy
functions used in these models include: (1) a physical energy function calculated
using ab initio quantum mechanics (QM); (2) an empirical energy function or force
field derived from experimental data (Capriotti, et al., 2004); or (3) a statistical energy
function obtained by the analysis of protein structures. Both ab initio QM and force
field calculations are time-consuming and sensitive to small displacements in protein
structures. Ab initio QM calculations are also impractical for large proteins. Statistical
approaches may provide a similar accuracy to ab initio QM calculation, but their
theoretical foundation is not clear (Lazaridis and Karplus, 2000). Machine
learning-based methods include, for example, those utilizing support vector machines
and neural-networks (Capriotti et al., 2005; Capriotti, Fariselli and Casadio, 2004;
Cheng et al., 2006; Guerois et al., 2002; Shen et al., 2008). Many methods predict just
the sign of ΔΔG. A positive or negative ΔΔG corresponds to an increase or decrease in
the protein stability, respectively. Sequence-based classifiers have been trained and
2
tested using different sequence window lengths (Capriotti, Fariselli, Calabrese and
Casadio, 2005; Cheng, Randall and Baldi, 2006), or in combination with structural
information (Capriotti, et al., 2004). However, the inputs are encoded using the
20-alphabet amino acid code, so the number of parameters rises rapidly to more than
one hundred as a result. Thus, these models may be over-fitted and with consequent
effects on performance when applied to new cases (Khan and Vihinen, 2010). A
database containing experimental ∆∆G values for variations, ProTherm, is available
for just over 2000 cases (Bava et al., 2004; Kumar et al., 2006). The performance of
most published stability predictors was recently tested systematically using known
examples and it was found to be suboptimal with all methods (Khan and Vihinen,
2010).
2 Runtime environment
PPSC is running on Window or Linux Operation System. It can be downloaded from
http://www.ibio-cn.org/softwares/PPSC/index.html (Server in China) or
http://structure.bmc.lu.se/ppsc/ (Mirror in Sweden). Before running the program,
follow the steps.
1. Install Java Runtime Environment (JRE) or Java Development Kit (JDK) version
6.0 later. JDK can be downloaded from
http://www.oracle.com/technetwork/java/javase/downloads/index.html.
2. For Windows: Download the executable file (DSSPCMBI.EXE) from
http://swift.cmbi.ru.nl/gv/dssp/ and copy it to the folder where the jar file is.
For Linux: Download the source code for DSSP from the website
(http://swift.cmbi.ru.nl/gv/dssp/), compile the source code and generate an
executable file (DSSPCMBI) and then copy it to the route where the jar file is;
3. Double click the “PPSC-v1.0.jar” file to run the program. Sometimes firewalls can
cause problems and may need to be adjusted.
3
3 Functions
Prediction of Protein Stability Changes (PPSC) is based on M8 and M47 models
developed to utilize 8 and 47 features, respectively, in a support vector machine
optimized with experimental data. PPSC predicts the effect of protein variation
stability. It has four main panels: download PDB file, calculate contact energy (CE),
train model, and predict stability change. As PPSC is trained with protein structures,
PDB file has to be downloaded. The function of “calculate CE” is to calculate CE
value, because CE value is one of the input attributes. The program can calculate
effects for one or many variants by uploading a batch file. The results of the
calculation are displayed as tables or image files. The function of “train model” can be
used to train a new SVM model with user defined dataset.
3.1 Download PDB File
All PPSC functions are based on PDB files of proteins, which are downloaded from
the Protein Data Bank (http://www.pdb.org/pdb/home/home.do) or from user defined
site. There are two input options: a single PDB id or TXT file containing a list of PDB
ids one at a line (Fig 3.1).
The TXT file format is:
PDB-id1
PDB-id2
….
For example:
1B55
1VQB
1A23
4
Fig 3.1 “download PDB File” panel
3.2 Calculate CE
PPSC contains two models to predict protein stability change upon substitution.
Model M8 requires 8 input attributes and M47 requires 47. The attribute dCE, which
was defined as the difference in total contact energy between the wild-type protein
and the variant protein, has to be calculated separately. A coarse-grained model (Shen
and Vihinen, 2003) is implemented in “Calculate CE” panel (Fig 2.2).
There are two input ways:
1. Input a single PDB id, variation position and new residue.
2. Input a txt file containing a list of the information. The file format is :
PDB-id1 Variation1 pH1 Temperature 1 ddG1
PDB-id2 Variation2 pH2 Temperature 2 ddG2
….
Usually the ddG values are omitted. However, if you want to train a novel predictor
with cases with known effect the values should be provided. The default values for pH
and temperature are 7 and 25 and should only be changed if training a new method.
5
For example:
1AAR K6E 5 25 0.53
1AAR K6Q 5 25 0.26
1AAR F45W 5 25 0.32
The format of the results is either a list file or a SVM file, or both. SVM file is used to
train new method and can be written only if ddG value is available.
Clicking “Results and running status box” shows the result in both a table and a
histogram view. They can be saved to Excel or txt files with mouse right-click.
Fig 3.2 “Calculate CE” panel
3.3 Train model
This option is used for training new SVM model. For that purpose a relatively large
set of experimentally studied cases with ddG values is needed.
As shown in Fig3.3, the input file containing training data should be in the following
format:
PDB-id1 Variation1 pH1 Temperature 1 ddG1
PDB-id2 Variation2 pH2 Temperature 2 ddG2
6
….
For example:
1AAR K6E 5 25 0.53
1AAR K6Q 5 25 0.26
Set a route for saving the result as a “*.model” file. The default directory is
“\PPSC-v1.0_jar\libscm-model\”. Then parameters for the SVM should be provided.
Consult LIBSVM (www.csie.ntu.edu.tw/~cjlin/libsvm) for possible options.
Fig 3.3 “Train model” panel
3.4 Predict stability
By clicking “Predict stability” the effect on stability and the ddG are calculated.
Select M8 or M47 model or both. Choose either "Predict stability" or "Predict ddG"
function.
Then, we choose either input single variation information or input a list of variations.
The single variation information contains PDB id, variation position, new residue, as
well as temperature and PH values.
The list file containing multiple variations is in the following format:
7
PDB-id1 Variation1 pH1 Temperature1
PDB-id2 Variation2 pH2 Temperature2
….
For example:
1AAR K6E 5 25
1AAR K6Q 5 25
1AAR F45W 5 25
Note that M47 model produces more reliable results than M8 model.
Fig 3.4 “Predict stability” panel
4 Input and output files
4.1 Formats of input files
1. The format of PDB id list file to input in “Download PDB file” panel is:
8
PDB-id
For example:
1B55
1VQB
1A23
2. The format of list file in “Calculate CE” panel is :
PDB-id1 Variation1 pH1 Temperature 1 ddG1
PDB-id2 Variation2 pH2 Temperature 2 ddG2
….
The ddG value should be omitted, unless you want to get the svm-file to obtain a new
predictor. The default pH and temperature values are 7 and 25, adjust only if
necessary.
For example:
1AAR K6E 5 25 0.53
1AAR K6Q 5 25 0.26
1AAR F45W 5 25 0.32
3. The format of input file in “Train model” panel is:
PDB-id1 Variation1 pH1 Temperature 1 ddG1
PDB-id2 Variation2 pH2 Temperature 2 ddG2
….
For example:
1AAR K6E 5 25 0.53
1AAR K6Q 5 25 0.26
4. The format of input file in “Predict stability” panel is:
PDB-id1 Variation1 pH1 Temperature1
PDB-id2 Variation 2 pH2 Temperature2
9
….
For example:
1AAR K6E 5 25
1AAR K6Q 5 25
1AAR F45W 5 25
4.2 Output files
The results shown in the four main panels are visual in table and histogram views. All
tables can be saved as either txt or Excel files, and all histograms can be saved as JPG
files by right clicking the mouse.
5 Example
This part shows how to use PPSC with a brief example. The aim is to calculate CE
and predict stability. The data are listed in a file named "example.txt" with the content
as follows:
1LZ1 I89A 2.7 64.9
1LZ1 V121A 2.7 64.9
1VQB C33I 7 25
1BNI A32Y 6.3 25
1CYO V61Y 7 67.2
The file meets all format requirements in 3 main panels but the “Train model”.
Double click the “PPSC-v1.0.jar” file to start the program:
Step 1, download PDB file in” Download PDB file” panel (Fig 5.1). We use
“example.txt” as input and you can find “OK” in status column when all PDB files are
fully downloaded.
10
Fig 5.1 Download PDB file
Step 2, calculate CE using “example.txt” and the CE result is shown in red in the table,
see Fig 5.2.
Fig 5.2 Calculate CE
Step3, predict stability. We used both “M8” and “M47” to predict stability. Input the
“example.txt” as a list of variants and get the result in Fig 5.3.
11
Fig 5.3 Predict stability of the example by M8 and M47
Or, when choosing to predict ddG and the result save as in Fig 5.4 and 5.5.
Fig 5.4 Right click to save the result of “Predict ddG” using M8 and M47 as Excel and txt file
12
Fig 5.5 Right click to save the result image
6 References
Alberty RA. 1969. Standard Gibbs free energy, enthalpy, and entropy changes as a
function of pH and pMg for several reactions involving adenosine phosphates.
J Biol Chem 244:3290-302.
Capriotti E, Fariselli P, Calabrese R, Casadio R. 2005. Predicting protein stability
changes from sequences using support vector machines. Bioinformatics 21
Suppl 2:ii54-8.
Capriotti E, Fariselli P, Casadio R. 2004. A neural-network-based method for
predicting protein stability changes upon single point mutations.
Bioinformatics 20 Suppl 1:i63-8.
Chih-Chung Chang, Lin C-J. 2001. LIBSVM: a library for support vector machines.
Collantes ER, Dunn WJ, 3rd. 1995. Amino acid side chain descriptors for quantitative
structure-activity relationship studies of peptide analogues. J Med Chem
38:2705-13.
Eisenberg D, Schwarz E, Komaromy M, Wall R. 1984. Analysis of membrane and
13
surface protein sequences with the hydrophobic moment plot. J Mol Biol
179:125-42.
Ferrer-Costa C, Orozco M, de la Cruz X. 2002. Characterization of disease-associated
single amino acid polymorphisms in terms of sequence and structure
properties. J Mol Biol 315:771-86.
Keerthi SS, Lin C-J. 2003. Asymptotic behaviors of support vector machines with
Gaussian kernel. Neural Computation 15:1667-89.
Khan S, Vihinen M. 2010. Performance of protein stability predictors. Hum Mutat
Khatun J, Khare SD, Dokholyan NV. 2004. Can contact potentials reliably predict
stability of proteins? J Mol Biol 336:1223-38.
Lazaridis T, Karplus M. 2000. Effective energy functions for protein structure
prediction. Curr Opin Struct Biol 10:139-45.
Lin H-T, Lin C-J. 2003. A study on sigmoid kernels for SVM and the training of
non-PSD kernels by SMO-type methods. Department of Computer Science,
National Taiwan University.
Matthews BW. 1975. Comparison of the predicted and observed secondary structure
of T4 phage lysozyme. Biochim Biophys Acta 405:442-51.
Shen B, Bai J, Vihinen M. 2008. Physicochemical feature-based classification of
amino acid mutations. Protein Eng Des Sel 21:37-44.
Shen B, Vihinen M. 2003. RankViaContact: ranking and visualization of amino acid
contacts. Bioinformatics 19:2161-2.
Thusberg J, Vihinen M. 2009. Pathogenic or not? And if so, then how? Studying the
effects of missense mutations using bioinformatics methods. Hum Mutat
30:703-14.
Zamyatnin AA. 1972. Protein volume in solution. Prog Biophys Mol Biol 24:107-23.