Using Regression Analyses For the Determination of Protein ...

Using Regression Analyses

For the Determination of Protein

Structure

From FTIR Spectra

A thesis submitted to

the University of Manchester for the degree of PhD

in the Faculty of Life Sciences

2014

Kieaibi Ellis Wilcox

CONTENTS

2

Table of Contents

Table of Contents ....................................................................................... 2

Table of Figures .......................................................................................... 6

List of Tables ............................................................................................ 14

List of Figures and Tables in Appendices .............................................. 16

Abbreviations ........................................................................................... 18

Abstract ..................................................................................................... 19

Declaration ................................................................................................ 20

Copyright Statement ................................................................................ 21

Acknowledgement .................................................................................... 22

1 Introduction ....................................................................................... 23

1.1 Protein Structure ........................................................................... 23

Primary structure .................................................................. 24 1.1.1

Secondary Structure .............................................................. 25 1.1.2

Tertiary structure .................................................................. 28 1.1.3

Quaternary Structure ............................................................ 29 1.1.4

Protein Fold ........................................................................... 30 1.1.5

1.2 Experimental Determination of Protein Structures ...................... 31

Vibrational Spectroscopy ...................................................... 32 1.2.1

Fourier Transformed Infrared (FTIR) ..................................... 32 1.2.2

1.3 Computational Analysis Theory .................................................... 38

Multivariate Analysis ............................................................. 39 1.3.1

Data Pre-processing .............................................................. 45 1.3.2

Validation of the model ........................................................ 52 1.3.3

Determination of the unknown ............................................ 53 1.3.4

1.4 Aim of this research ...................................................................... 53

Research Objectives .............................................................. 54 1.4.1

USING REGRESSION ANALYSES FOR THE DETERMINATION

OF PROTEIN STRUCTURE FROM FTIR SPECTRA 3

2 Experimental Method: ATR-FTIR spectroscopy .......................... 56

2.1 Introduction .................................................................................. 56

2.2 Sample preparation ...................................................................... 57

2.3 Spectral measurement and Processing ......................................... 57

Baseline Correction and Atmospheric Correction ................ 58 2.3.1

ATR correction....................................................................... 58 2.3.2

2.4 Data Pre-processing for Multivariate Analysis .............................. 60

Outlier Detection .................................................................. 62 2.4.1

Centring and Scaling. ............................................................. 64 2.4.2

Multiplicative Scattered Correction (MSC) ........................... 65 2.4.3

Standard Normal Variate (SNV) ............................................ 67 2.4.4

2nd Derivative ....................................................................... 68 2.4.5

2.5 Discussions .................................................................................... 70

3 PCA Protein Structure Classification ............................................. 73

3.1 Introduction .................................................................................. 73

3.2 Dataset used for this method ....................................................... 75

3.3 PCA Model ..................................................................................... 75

3.4 Model Validation ........................................................................... 77

3.5 Results and Discussion .................................................................. 78

3.6 Conclusion ..................................................................................... 94

4 PLS Regression for Protein Secondary Structure Determination97

4.1 Introduction .................................................................................. 97

4.2 Protein dataset used for this method ........................................... 99

Data Selection Scheme........................................................ 100 4.2.1

4.3 PLS Modelling .............................................................................. 101

4.4 Results and Discussions .............................................................. 103

Model Validation ................................................................. 106 4.4.1

Cross-Validation .................................................................. 106 4.4.2

CONTENTS

4

Independent Test Set .......................................................... 108 4.4.3

Coefficient of Determination (R2) ....................................... 109 4.4.4

PCR ...................................................................................... 110 4.4.5

PLS with 2nd Derivative ........................................................ 111 4.4.6

Zeta scores ζ ..................................................................... 115 4.4.7

PCR/PLS comparison ........................................................... 116 4.4.8

Results for Proteins in D2O .................................................. 119 4.4.9

4.5 Other PLS Applications ................................................................ 122

iPLS Method ........................................................................ 123 4.5.1

4.6 Results and Discussion ................................................................ 128

α-helix ................................................................................. 128 4.6.1

β-sheet ................................................................................ 130 4.6.2

4.7 Validation Set .............................................................................. 131

4.8 Conclusion ................................................................................... 135

5 Multivariate analysis of pH effect on α-lactalbumin ................... 137

5.1 Introduction ................................................................................ 137

5.2 Sample preparation .................................................................... 139

5.3 ATR-FTIR Spectroscopy ............................................................... 140

5.4 PLS and PCA Modelling Technique .............................................. 140

5.5 Results and Discussion ................................................................ 142

Inspection of the Amide I Spectral Plots ............................. 142 5.5.1

Amide I PLS Results ............................................................. 146 5.5.2

Inspection of the Amide II Spectral Plots ............................ 149 5.5.3

Amide II PLS Results ............................................................ 150 5.5.4

Inspection of the Amide III Spectral Plots ........................... 151 5.5.5

Amide III PLS Results ........................................................... 154 5.5.6

PCA of α-Lactalbumin Spectra ............................................ 155 5.5.7

Comparison with other methods ........................................ 158 5.5.8



5.6 Conclusion ................................................................................... 160

6 Limit of Quantitation of Protein Structure Using PLS ............... 163

6.1 Introduction ................................................................................ 163

6.2 Sample preparation .................................................................... 165

6.3 FTIR Spectroscopy ....................................................................... 165

6.4 PLS Model.................................................................................... 166

6.5 Results and discussion ................................................................ 167

6.6 Conclusion ................................................................................... 180

7 Conclusion ....................................................................................... 182

8 References ........................................................................................ 188

Appendix A ............................................................................................. 209

Appendix B ............................................................................................. 213

Appendix C ............................................................................................. 225

Appendix D ............................................................................................. 230

46,795 words

CONTENTS

6

Table of Figures

Figure 1.1. This is a diagrammatic representation of the peptide

bond of amino acid residues and the backbone angles

that are formed; phi (Φ) for the angle between N-Cα, psi

(Ψ) for the angle between Cα-C bond, and omega (Ω) for

the angle between C-N bond. .............................................. 25

Figure1.2. This figure shows α–Helix, 310-helix, and π-helix

formation. Dotted red line shows the bonding pattern that

can occur between the ‘i’ and the i+3, i+4, i+5 residues ..... 26

Figure1.3. Parallel and anti-parallel β-Sheet formation ........................ 28

Figure1.4. Electromagnetic Spectrum. frequency (ν) = speed /

wavelength(λ). The Mid- IR region of the spectrum is

the wavelength region between 3x10-4

and 3x10-3

cm or

about 4000 – 400 cm-1

in wavenumbers .............................. 33

Figure 1.5. A general schematic of an FT-IR spectrometer. .................. 35

Figure 1.6. This diagram illustrates the evanescent wave reaching

into the sample at 45° angle. This angle is called the

angle of incidence. ............................................................... 37

Figure 1.7. Diagrammatic representation of the generalized

multivariate equation represented in eq. 1.3. ‘Y’ is

predicted on the basis of ‘X ‘and in such a way that the

residual, ‘ε’ is minimized. ................................................... 41

Figure 1.8a-c. Figure 1.8a shows the original data points. Figure 1.8b:

shows the centred data which are the individual points

minus the mean of the data; this puts the new mean at 0,

making some of the data negative. Figure 1.8c. Scaled

data: data looks shortened and broadened in an attempt to

decrease group distance. This results in a zero mean and

unit variance of any descriptor variable. ............................. 47

Figure 1.9. Outlier plot showing a red dot as an outlier, which

clearly does not correlate with the rest of the data. ............. 50

Figure 2.1a. Blue spectrum of fibrinogen is ATR-corrected to attain

the correct Amide I:Amide II band ratio of its

transmission counterpart. ..................................................... 59

Figure 2.1b. Amide I – Amide II transmission band ratio is not totally

attained by the ATR corrected carbonic anhydrase

spectrum (in Blue). .............................................................. 59

Figure 2.2a. ATR-FTIR Spectra of the 90 proteins in H2O solution

(spectra of the Amide I and Amide II region from

different instruments) before scaling, removal of outliers

and second derivative. ......................................................... 61



Figure 2.2b. ATR-FTIR Spectra of the 90 proteins in H2O solution

(spectra of the Amide I and Amide II region different

instruments) after normalization but before outlier

detection. ............................................................................. 62

Figure 2.3a. Outlier detection using mahalanobis distance. Plot

shows D-glucosamine (sugar) and Spectrin as outliers.

The mean of each spectrum and the standard deviation of

each sample are plotted on the x-axis and y-axis,

respectively .......................................................................... 63

Figure 2.3b. Distribution of protein samples after outliers were taken

out. ....................................................................................... 64

Figure 2.4. 84 scaled protein spectra (Amide I region) measured in

H2O ...................................................................................... 65

Figure 2.5a. MSC pre-processing performed on 84 raw data showing

reduction in spectral intensities. Spectra are pulled

towards the raw spectrum with the mean value................... 66

Figure 2.5b. MSC pre-processing performed on normalized spectra.

Spectra are pulled towards the normalized mean sample. ... 66

Figure 2.6a. SNV pre-processing performed on 84 protein spectra of

raw protein samples ............................................................. 67

Figure 2.6b. SNV pre-processing done on normalized 84 protein

spectra .................................................................................. 68

Figure 2.7. Second derivative on the normalized data of the ATR-

FTIR spectra of 84 proteins collected in H2O ..................... 69

Figure 3.1. PC plot of scaled data for the Amide I region of 84

spectra showing percentage variance of the first 4

principal components. The blue line in the plot indicates

the cumulative percentage variance for all 4 PCs. .............. 78

Figure3.2. PCA scores plot for Amide I region (1600 cm-1

–

1700cm-1

) of ATR-FTIR spectra of 84 proteins in H2O

(PC1xPC2). Plot shows the clustering of scores. Blue

dots are primarily α proteins, red dots are primarily β

proteins, green dots indicate proteins of other types ........... 80

Figure 3.3. Loading Plot of the Amide I (1600 cm-1

– 1700cm-1

)

region for spectra of 84 proteins in H2O This Plot shows

the clustering of scores. Wavenumbers of spectral

signals indicate the discriminating features associated

with clustering of scores in Figure 3.2. ............................... 81

Figure 3.4. PC1, PC2, and PC3 loadings plot of Amide I (1600 cm-1

– 1700cm-1

) region for spectra of 84 proteins in H2O

(PC1xPC2). The wavenumbers listed indicate the

discriminating features associated with clustering of

scores in Figure 3.2. Both Figure 3.3 and Figure 3.4

were combined in the analysis of the scores plot (Figure

3.2). ...................................................................................... 81

CONTENTS

8

Figure 3.5. PCA scores of ATR-FTIR spectra of proteins in H2O.

Plot highlights the clusters of mixed proteins; αβ proteins

(multi-domain) as green dots, α+β proteins (mainly anti-

parallel) as red stars, α/β proteins (mainly parallel) as

black stars. The little unlabelled blue and red dots are α-

helix and β-sheet proteins highlighted in Figure 3.2 ........... 84

Figure 3.6. Scaled ATR-FTIR spectra of the Amide I region for

ribonuclease S in red (which contains mainly anti-

parallel β-sheet), transketalose in blue (mainly parallel β-

sheet), and lactic dehydrogenase in green (contains both

parallel and anti-parallel β-sheet) ........................................ 85

Figure 3.7. PC2 loading spectra plot of the ATR-FTIR Amide I

region for 84 proteins in H2O. ............................................. 86

Figure 3.8. Biplot (PC scores and loadings) of Amide I region; blue

lines show the length of each loading and their

directions. Red dots are the protein sample scores. ............ 88

Figure 3.9. PC plot of Scaled Amide II region of 84 spectra showing

percentage variance of the first 4 principal components.

The blue curve indicates the cumulative variation of all 4

components .......................................................................... 90

Figure 3.10. Scores plot for Amide II region (1480 cm-1

– 1600cm-1

)

of spectra of proteins in H2O (PC1xPC2). Plot shows the

clustering of scores. Blue dots are primarily α-helix rich

proteins, red dots are primarily β-sheet rich proteins, and

green dots indicate proteins of other structural types .......... 92

Figure 3.11. Loading Plot of the Amide II (1480 – 1600cm-1

) region

of spectra for 84 proteins in H2O (PC1xPC2). Plot

shows the clustering of scores. Wavenumbers in the

plot, especially the extreme right and left numbers

indicate the discriminating wavenumbers associated with

clustering of scores in Figure 3.10. ..................................... 93

Figure 4.1. Schematic representation of samples used in the training

set and the test set after application of the Kennard Stone

selection algorithm. Matrix X represents the FTIR

spectra while vector ‘y’ represents the percentage

fraction of structure determined by DSSP analysis from

X-ray crystallographic data. This figure represents the

strategy for the 84 protein samples measured in H2O. ........ 101

Figure 4.2. Explained Y variance of α-helix for Amide I region;

explained variance is about 96% for 10 components. ......... 104

Figure 4.3. Plot shows Root Mean Square Error (RMSE) vs. number

of components for α-helix of the Amide I region. 10

components gave low root mean square error and was

chosen to fit the data. ........................................................... 104

Figure 4.4. Explained Y variance of β-sheet for the Amide I region

is about 95 % for 10 components. ....................................... 105



Figure 4.5. Root Mean Square Error for β-sheet analysis for the

Amide I region. .................................................................... 106

Figure 4.6. CV cross-validation: 4 PLS components only are

required to fit the α-helix model for proteins in H2O for

the Amide I region. .............................................................. 108

Figure 4.7. Residual plot of the 60 samples in the Amide I region.

Error levels between the fit and the individual data are

shown for each sample. Note: The residuals are not

sorted by protein but by the order in which they were

analysed. .............................................................................. 110

Figure 4.8. Performance of CV cross-validation: 4 PLS components

is optimal for PLS-CV while 5 components is optimal for

PCR-CV. .............................................................................. 111

Figure 4.9. The plot of the calibration set for the analysis of α-

helix content from Amide I region : This plot shows

PLS fitted vs response (X-ray crystallography values)

using the FTIR spectra of 60 proteins in H2O for

calibration. The red line of fit going through the origin

emphasises the positive correlation in the data. .................. 113

Figure 4.10. Test set plot for analysis of α-helix from the Amide I data:

PLS fitted vs response on 44 proteins in H2O used for

testing the model. The model shows the relationship

between the crystal structure values and the calculated

values of α-helix from FTIR spectra for these proteins. ..... 113

Figure 4.11. Plot shows PLS prediction for β-sheet vs. X-ray

crystallography vales used in the training of the Amide I

region. The training set had FTIR spectra of 60 proteins

in H2O. The red line of fit going through the origin

emphasises the positive correlation in the data. .................. 114

Figure 4.12. Test set plot of Amide I β-sheet: PLS fitted vs response

on 44 proteins in H2O used for testing the model. .............. 114

Figure 4.13. Plot of RMSECV vs. PLS component for the global

model. Component number 2 has the lowest RMSECV. ... 124

Figure 4.14a Each bar in the plot is a spectral interval. The plot shows

interval numbers 1-7 ; Interval number 5 corresponds to

frequency factors in the range 1641 cm-1

– 1628 cm-1

for

α-helix, The dotted line represents the RMSECV for the

global iPLS model. Numbers in italics in each interval

bar signifies the number of iPLS components used in

fitting each local model. See Table 4.3 for intervals and

wavenumber range. The curve in the background is a

spectrum of the Amide I region. .......................................... 125

Figure 4.14b. Plot shows interval number 6 (6th

bar) corresponding to

frequency factors in the range 1614–1627 cm-1

for β-

sheet. See Table 4.3 for each interval actual beginning

and ending wavenumbers). Dotted line is the RMSECV

CONTENTS

10

for the global iPLS model. The dotted line represents the

RMSECV for the global iPLS model. Numbers in italics

in each interval bar signifies the number of iPLS

components used in fitting each local model. The curve

in the background is a spectrum of the Amide I region. ..... 126

Figure 4.15a. Plot of the Amide I normalized spectral region shows the

spectral region (in grey bar) used for iPLS analysis of α-

helix structure. . The grey bar contains the most

variation for of α-helix structure. This region (1641–

1628 cm-1

) corresponds to interval number 5 from Table

4.14a. ................................................................................... 127

Figure 4.15b. Plot of the normalized spectral region of Amide I that

contain the most significant region (in grey bar) for β-

sheet structure. The grey bar contains the most variation

for β-sheet structure. This region (1627 cm-1

–1614 cm-1

) was assigned to interval number 6 from Figure 4.14b. ..... 127

Figure 4.16a. Prediction line of fit for α-helix from the iPLS global

model in the in Amide I region (1700 – 1600 cm-1

). 3

iPLS components were used for building the model. The

numbers in the plot represent the different proteins in the

dataset which are listed in Table 4.4 ................................... 129

Figure 4.16b. Prediction line of fit for α-helix from the iPLS local

model in the in Amide I region (1641 cm-1

– 1628 cm-1

).

3 PLS components were used for building the model.

The numbers in the plot are the different proteins in the

dataset which are named in Table 4.4. ................................ 129

Figure 4.17a. Prediction line of fit for β-sheet from the iPLS global

model in the Amide I region (1700 cm-1

– 1600 cm-1

). 3

iPLS components were used for building the model. The

numbers in the plot are the different proteins in the

dataset which are named in Table 4.4. ................................ 130

Figure 4.17b. Prediction line of fit for β-sheet from the iPLS local

model in the Amide I region (1627 cm-1

– 1614 cm-1

). 3

iPLS components were used for building the model from

the 6th

spectral interval. The numbers in the plot are the

different proteins in the dataset which are named in

Table 4.4. ............................................................................. 131

Figure 4.18a. Plot of local model line of fit for the prediction of α-helix

in Amide I (1641 cm-1

– 1628 cm-1

). 3 PLS components

for the 5th

spectral interval were used for the prediction

of proteins in the test set.. The numbers in the plot are

the different proteins in the dataset which are named in

Table 4.4 .............................................................................. 132

Figure 4.18b. Plot of local model line of fit for the prediction of β-sheet

in Amide I (1627 cm-1

– 1614 cm-1

). 3 PLS components

for the 6th

spectral interval were used for the prediction

of proteins in the test set. The numbers in the plot are the



different proteins in the dataset which are named in

Table 4.4. ............................................................................. 132

Figure 5.1. The diagram on the left of α-LA (PDB ID: 1HFZ) shows

the native structure of α-LA, the intermediate unfolded

state is shown in the middle and the unfolded state is

represented on the right. The PDB reported structure of

the bovine α-LA (1HFZ is 43% helical structure, 9% -

sheet, 48% other. The Ca2+

ions are shown in black, the

blue ribbon represents α-helix, the green ribbon

represents 310-helix and the green arrows represent β-

sheet 139

Figure 5.2. Normalized FTIR spectra of α-LA from pH 7.0 to pH 2.0

covering spectral region 1900 - 1000 cm-1

. ......................... 141

Figure 5.3 The normalized Amide I region of -LA shows the

changes and shifts in spectral peaks as pH decreases

from 4.0 to 2.0 ..................................................................... 142

Figure 5.4a. Plot of the second derivatives of the FTIR spectrum for α-

LA in the native state at pH 7.0. A peak at 1657-1658

cm-1

is clearly visible. The second derivatives for pH 6.5

and 6.0 (not shown) are very similar. .................................. 143

Figure 5.4b. The plot of the normalized spectra of α-LA at pH 6-7 and

the changes in the intensities and frequencies at about

1657 cm-1

. These spectra of pH 6.5 and pH 6.0 are

similar to the spectrum of pH 7.0 which is in the native

state. 144

Figure 5.5a. This figure shows well resolved second derivative of α-LA

in the pH 3.0 and pH 3.5; the second derivative spectrum

shows a peak at about 1657 cm-1

. The green spectrum

shows changes in intensity. ................................................. 145

Figure 5.5b. This figure shows well resolved second derivative of α-LA

at pH 2.0. The second derivative spectrum shows a peak

at about 1720 cm-1

. .............................................................. 145

Figure 5.5c. Plot of the normalized spectra of α-LA at pH 2.0-3.5 and

the shifts in the intensities and frequencies. ........................ 146

Figure 5.6. This plot highlights changes in the secondary structure of

α-LA as a function of pH, based on PLS analysis. The

horizontal axis values represents the pH values, the

vertical axis represents the secondary structure

proportions. Below pH 3.0, a gain in β-sheet and a loss

in α-helix is seen. The error bars were calculated from

the PLS result residual for each spectrum (which is the

observed X-ray value – the calculated FTIR value). ........... 149

Figure 5.7. The plot of the normalized spectra of α-LA at pH 2.0 -

7.0. Plot shows peak shift and structural changes

especially for pH 2.5 and pH 2.0 ......................................... 150

CONTENTS

12

Figure 5.8. This is the plot of the effect of pH 3.5 (intermediate) –

pH 2.0 on the normalized Amide I spectral region of α-

LA protein. Plot shows slight changes at these pH

levels. The different spectral colours represent different

pH levels. ............................................................................. 152

Figure 5.9. Plot of Amide III region depicts peaks shift for pH 5.5 -

pH 4.0 at 1250 cm-1

and pH 4.0 at 1279 cm-1

respectively. ......................................................................... 153

Figure 5.10. Plot of Amide III region depicts peak for pH 7.0 (native

protein) at 1256 cm-1

........................................................... 154

Figure 5.11. Score plot of α-LA at different pH on the first and

second principal components. Principal Component

Score plot shows the classification of α-LA spectra based

on the pH levels. PC1 has more native state spectra on

left and acidic state spectra on the right. PC2 separates

the intermediates at the top .................................................. 156

Figure 5.12. Loading plot shows the most significant wavenumbers

that contribute to the spectral discrimination seen in the

score plot in figure 5.13. ...................................................... 157

Figure 5.13. Loading spectra of PC1 and PC2 shows the most

significant wavenumbers that contribute to the spectral

discrimination seen in the score plot in Figure 5.11.

Both plots of Figure 5.12 and 5.13 were combined for

the analysis, but can be used separately. ............................. 157

Figure 6.1. The raw spectra of six proteins (β-Lactoglobulin, BSA,

fibrinogen, IgG, lysozyme, and ubiquitin) acquired at

200, 100, 50, 25, 12.5, 6.25, 3.13, 1.56, 0.78, 0.39, 0.19

mg/mL. ................................................................................ 166

Figure 6.2. Spectra of 66 proteins in the Amide I region in

concentrations between 200 mg/mL and 0.19 mg/mL and

scaled for PLS analysis. ....................................................... 168

Figure 6.3. Comparison of normalized spectra of Lysozyme at

concentration of 0.19 mg/mL, 50mg/mL and 200 mg/mL

in the Amide I region with major peak at 1654 cm-1

........... 169

Figure 6.4a. The second derivative spectra of the Amide I region of

IgG in aqueous solution at 200 mg/mL, 50 mg/mL and

0.19 mg/mL collected with FTIR spectroscopy .................. 170

Figure 6.4b. The normalized spectra of the Amide I region of IgG in

aqueous solution at 200 mg/mL, 50 mg/mL and 0.19

mg/mL collected with FTIR spectroscopy. ......................... 171

Figure 6.5. Plot of predicted α-helix content from FTIR spectra of

concentrations between 0.19 mg/mL and 200mg/mL in

comparison to the content measured from X-ray

crystallography. The Amide I region (1600 – 1700 cm-1

)

was used for the PLS analysis. ............................................ 179



Figure 6.6. Shows β-sheet structure predicted from FTIR spectra of

concentrations between 0.19 mg/mL and 200mg/mL vs.

X-ray crystallography. Amide I region (1600 – 1700 cm-

1) was used for the PLS analysis.......................................... 180

CONTENTS

14

List of Tables

.Table 2.1. A table of pre-processing methods and their minimum and

maximum values of spectral intensities...................................69

Table 3.1. PCA analysis for the Amide I region listing the

eigenvalues Sum of Squares, their percentage variance and

the cumulative variance. ..........................................................79

Table 3.2. PCA Table shows the eigenvalues Sum of Squares, their

percentage variance and the cumulative variance for the

Amide II region. ......................................................................91

Table 3.3. FTIR marker bands of protein classes in the Amide I

region. ......................................................................................96

Table 4.1a. Prediction results of PLS models of calibration and

validation sets for amide I, amide II, and amide I & II,

using Normalized + 2nd

derivative pre-processed data are

shown as grey highlights (rows 3,4 & 5). Results for PCR

and PLS methods on ‘Normalized + SNV’ are also shown

in the first and second row. For all methods, 60 protein

samples were used for calibration and for internal cross-

validation, 44 samples were used for independent test set. ...118

Table 4.1b. Prediction results of PLS models of calibration and

validation sets for amide I, amide II, and amide I & II,

using Normalized + 2nd derivative pre-processed data are

shown as grey highlights (rows 3,4 & 5). Results for PCR

and PLS methods on ‘Normalized + SNV’ are also shown

in the first and second row. For all methods, 60 protein

samples were used for calibration and for internal cross-

validation, 44 samples were used for independent test set.

– Continue from table 4.1a. ..................................................119

Table 4.2. Comparison of PLS results of the secondary structure of

proteins in H2O and D2O. The percentage secondary

structure for each protein derived from X-ray

crystallography served as a reference. ...................................121

Table 4.3. Table of intervals and their corresponding frequencies;

interval number 8 which covers the Amide I region was

used for the iPLS global model. ............................................126

Table 4.4. The list of proteins used for the both calibration and test

set in the iPLS. These proteins are listed to match the

numbers in the line of fit plot for both calibration and

validation for iPLS models. ...................................................133

Table 4.5. Table shows results for iPLS models for both global and

local regression analyses. Results of the traditional PLS

analysis of the Amide I region (1600-1700 cm-1

) are listed

for comparison. ......................................................................134



Table 5.1. PLS prediction results for the secondary structure content

of Amide I region of α-lactalbumin at different pH levels

using normalized and 2nd

derivative pre-processed data.

Structural values are presented as proportions calculated

from 1HFZ (pH 8.0) file using DSSP script. .........................148

Table 5.2. PLS Prediction results for the secondary structure of

Amide II region of α-LA at different pH levels using

normalized and 2nd

derivative pre-processed data.

Structural values are presented in proportions. .....................151

Table 5.3. PLS Prediction results for the secondary structure content

of Amide III region of α -lactalbumin at different pH

levels using normalized and 2nd

derivative pre-processed

data. Structural values are presented in proportions. ............155

Table 6.1. PLS results of α-helix content of 6 proteins. FTIR spectra

were used for this analysis. Each protein spectra was

collected at concentrations ranging from 0.19 mg/mL to

200 mg/mL. ...........................................................................174

Table 6.2. PLS results of β-sheet content of 6 proteins. FTIR spectra



200 mg/mL. ...........................................................................175

Table 6.3. PLS results of 310-helix content of 6 proteins. FTIR

spectra were used for this analysis. Each protein spectra

was collected at concentrations ranging from 0.19 mg/mL

to 200 mg/mL. .......................................................................176

Table 6.4. PLS results of Turns content of 6 proteins. FTIR spectra



200 mg/mL. ...........................................................................177

Table 6.5. PLS results of Disordered content of 6 proteins. FTIR

spectra were used for this analysis. Each protein spectra

was collected at concentrations ranging from 0.19 mg/mL

to 200 mg/mL. .......................................................................178

Table 6.6. R2 and RMSE (δ) of PLS analysis of secondary structure

contents determined from FTIR spectra in the Amide I

regions. ..................................................................................179

CONTENTS

16

List of Figures and Tables in Appendices

Figure C1. Principal Component plot of Amide I region of 16 proteins

in D2O showing percentage variance of the first 4 PCs.

Blue line indicates that the accumulative percentage

variance is about 98%. ......................................................... 226

Figure C2. PCA Score plot of the Amide I region for 16 proteins in

D2O. The plot show lysozyme associated with α-helix on

the left. It was wrongly marked as ‘other’ before the

analysis. β-sheet proteins are clustered above right while

‘other’ appears below right. ................................................. 226

Figure C3. PCA loading spectra plot of the Amide I region for 16

proteins in D2O buffer ......................................................... 227

Figure C4. This loading plot indicates the spectral regions responsible

for the discriminations of the proteins. This plot does

correlate well with Figure c3 above; it highlights the

intensity wavenumbers associated with the clustering of

the proteins. 1653cm-1

is the intensity peak for α-helix

and 1630 cm-1

is the associated with β-sheet proteins on

the right of Figure c1 ........................................................... 227

Figure C5. This plot of the Amide I region of FTIR protein spectra

was produced from Discriminant Function Analysis. The

first DF discriminates between alpha-helix on the right and

beta-sheet proteins on the left; the second DF separates the

‘other’ proteins, mainly above zero but clustered in the

middle, from the rest of the data. ......................................... 228

Figure C6. This plot of the Amide II region of FTIR protein spectra

was produced from Discriminant Function Analysis. The

first DF discriminates between α-helix on the left and β-

sheet proteins on the right; the second DF separates the

‘other’ proteins, mainly above zero but clustered in the

middle, from the rest of the data. ......................................... 228

Figure D1. This top level flow chart represents the operations carried

out in this thesis. Level 3.0 and 4.0 can be done in the

same step as part of the multivariate regression analysis.

Each step of the chart is expanded on different flow

diagrams. See Figure D2 and D3 of this thesis. ................. 231

Figure D2. Step 1.0 of the flow chart highlights the steps taken for the

ATR-FTIR spectra measurements using the Bruker Tensor

spectrometer with Opus 6.2 software. The process in red

is an external entity to the spectra measurement; the

spectra can be saved and plotted in any chosen plotting

software. .............................................................................. 232

Figure D3. Step 2.0 captures the pre-processing steps that were done

on the ATR-FTIR measured spectra used for this research.

Each step of this flow chart was done in Matlab R2010

and Matlab R2013a. The process in red is not necessary



part of the pre-processing steps and is expanded in Figure

D4. ....................................................................................... 233

Figure D4. This is the generalized step for the multivariate analysis

done in this research. The PCA and PLS steps are not

broken down in this flow chart, detail methods can be read

from Chapter 3 (PCA) and Chapter 4 (PLS), this includes

the processes in navy blue colour that pertain to the

internal workings of most multivariate analysis. ................. 234

Table A1. FTIR Protein Band Assignments ......................................... 210

Table A2. Secondary structure fractions of 90 proteins obtained from

DSSP.(266) .......................................................................... 211

Table B1. Secondary Structure FTIR/PLS predicted result of each

individual sample in calibration set of 60 proteins in H2O

solution. ....................................................................................214

Table B2. Continuation of FTIR/PLS calibration results for ‘Parallel,

Anti-Parallel and Mixed β-sheet ..............................................215

Table B3. Continuation of FTIR/PLS calibration results for 310-helix,

turns and other .........................................................................216

Table B 4. FTIR/PLS Cross-Validation results for α-helix, and β-sheet ...217

Table B5. FTIR/PLS Cross-Validation results for parallel β-sheet,

antiparallel β-sheet and mixed β-sheet ....................................217

Table B 6. FTIR/PLS Cross-Validation results for 310-helix, turns and

other .........................................................................................219

Table B7. Result of 44 samples in the test set used for Partial Least

Squares (PLS) analysis. About 20 spectra from calibration

set were included in the test set for validation purpose (α-

helix and β-sheet only). ..........................................................220



set were included in the test set for validation purpose

(parallel β-sheet, parallel β-sheet, and mixed β-sheet). ......221



set were included in the test set for validation purpose (310-

helix, β-turns and other). .......................................................222

Table B10. Results for FTIR/PLS comparison of proteins in H2O and

D2O. .........................................................................................223

Table B11. Continuation of Table B7 for comparison of proteins in H2O

and D2O results. .......................................................................223

Table B12. Continuation of Table B8 for comparison of proteins in H2O

and D2O results. .......................................................................224

Table C1. Accuracy of Discriminant Function Analysis .......................229

18

Abbreviations

2D Two Dimensional

310-helix Three-ten Helix

3D Three Dimensional

ATR Attenuated Total Reflection

C Coils

CD Circular Dichroism

CPL Circularly Polarized Light

CV Cross-Validation

DFA Discriminant Function Analysis

DNA Deoxyribonucleic Acid

EM Electro-Magnetic

FTIR Fourier Transformed Infrared

IR Infrared

MDA Multivariate Data Analysis

MSE Mean Square Error

MVR Multivariate Regression

NMR Nuclear Magnetic Resonance

O disordered

PC Principal Component

PCA Principal Component Analysis

PCR Principal Component Regression

PDB Protein Data Bank

PLS Partial Least Squares

R2 Coefficient of Determination

RMSCV Root Mean Square Error for Cross-Validation

RMSE Root Mean Square Error

RMSEC Root Mean Square Error of Calibration

RMSEP Root Mean Square Error of Prediction

RNA Ribonucleic Acid

SSerr Residual Sum of Squares

SStot Total Sum of Squares

STD Standard deviation

T Turns

X Matrix of independent x-variables

Y Matrix of dependent y-variables

ZnSe Zinc Selerum crystal

α-helix Alpha Helix

β-Sheet Beta Sheet

ε Residual Matrix Eigenvalue



Abstract

One of the challenges in the structural biological community is processing the wealth of

protein data being produced today; therefore, the use of computational tools has been

incorporated to speed up and help understand the structures of proteins, hence the

functions of proteins.

In this thesis, protein structure investigations were made through the use of Multivariate

Analysis (MVA), and Fourier Transformed Infrared (FTIR), a form of vibrational

spectroscopy. FTIR has been shown to identify the chemical bonds in a protein in

solution and it is rapid and easy to use; the spectra produced from FTIR are then

analysed qualitatively and quantitatively by using MVA methods, and this produces

non-redundant but important information from the FTIR spectra.

High resolution techniques such as X-ray crystallography and NMR are not always

applicable and Fourier Transform Infrared (FTIR) spectroscopy, a widely applicable

analytical technique, has great potential to assist structure analysis for a wide range of

proteins. FTIR spectral shape and band positions in the Amide I (which contains the

most intense absorption region), Amide II, and Amide III regions, can be analysed

computationally, using multivariate regression, to extract structural information. In this

thesis Partial least squares (PLS), a form of MVA, was used to correlate a matrix of

FTIR spectra and their known secondary structure motifs, in order to determine their

structures (in terms of "helix", "sheet", “310-helix”, “turns” and "other" contents) for a

selection of 84 non-redundant proteins. Analysis of the spectral wavelength range

between 1480 and 1900 cm-1

(Amide I and Amide II regions) results in high accuracies

of prediction, as high as R2 = 0.96 for α-helix, 0.95 for β-sheet, 0.92 for 310-helix, 0.94

for turns and 0.90 for other; their Root Mean Square Error for Calibration (RMSEC)

values are between 0.01 to 0.05, and their Root Mean Square Error for Prediction

(RMSEP) values are between 0.02 to 0.12. The Amide II region also gave results

comparable to that of Amide I, especially for predictions of helix content. We also used

Principal Component Analysis (PCA) to classify FTIR protein spectra into their natural

groupings as proteins of mainly α-helical structure, or protein of mainly β-sheet

structure or proteins of some mixed variations of α-helix and β-sheet. We have also

been able to differentiate between parallel and anti-parallel β-sheet. The developed

methods were applied to characterize the secondary structure conformational changes of

an unfolding protein as a function of pH and also to determine the limit of Quantitation

(LoQ).

Our structural analyses compare highly favourably to those in the literature using

machine learning techniques. Our work proves that FTIR spectra in combination with

multivariate regression analysis like PCA and PLS, can accurately identify and quantify

protein secondary structure. The developed models in this research are especially

important in the pharmaceutical industry where the therapeutic effect of drugs strongly

depends on the stability of the physical or chemical structure of their proteins targets;

therefore, understanding the structure of proteins is very important in the

biopharmaceutical world for drugs production and formulation. There is a new class of

drugs that are proteins themselves used to treat infectious and autoimmune diseases.

The use of spectroscopy and multivariate regression analysis in the medical industry to

identify biomarkers in diseases has also brought new challenges to the bioinformatics

field. These methods may be applicable in food science and academia in general, for the

investigation and elucidation of protein structure.

20

Declaration

No portion of the work referred to in the thesis has been submitted in support of an

application for another degree or qualification of this or any other university or other

institute of learning.



Copyright Statement

i. The author of this thesis (including any appendices and/or schedules to this

thesis) owns certain copyright or related rights in it (the “Copyright”) and

s/he has given The University of Manchester certain rights to use such

Copyright, including for administrative purposes.

ii. Copies of this thesis, either in full or in extracts and whether in hard or

electronic copy, may be made only in accordance with the Copyright,

Designs and Patents Act 1988 (as amended) and regulations issued under it

or, where appropriate, in accordance with licensing agreements which the

University has from time to time. This page must form part of any such

copies made.

iii. The ownership of certain Copyright, patents, designs, trademarks and other

intellectual property (the “Intellectual Property”) and any reproductions of

copyright works in the thesis, for example graphs and tables

(“Reproductions”), which may be described in this thesis, may not be owned

by the author and may be owned by third parties. Such Intellectual Property

and Reproductions cannot and must not be made available for use without

the prior written permission of the owner(s) of the relevant Intellectual

Property and/or Reproductions.

iv. Further information on the conditions under which disclosure, publication

and commercialisation of this thesis, the Copyright and any Intellectual

Property and/or Reproductions described in it may take place is available in

the University IP Policy (see

http://documents.manchester.ac.uk/DocuInfo.aspx?DocID=487), in any

relevant Thesis restriction declarations deposited in the University Library,

The University Library’s regulations (see

http://www.manchester.ac.uk/library/aboutus/regulations) and in The

University’s policy on Presentation of Theses.

22

Acknowledgement

I would like to thank my supervisors, Dr. Ewan Blanch and Prof. Andrew J. Doig, for

their generosity and for helping me transition into a new field. Thanks to Dr. Saeideh

Ostovar-Pour, my former colleague and friend, for her support, especially after surgery.

This research is dedicated to my mother, Dorah T. Wilcox, for all her teachings, and to

my late father, Ellis A. L. Wilcox who, as grisly as it may sound, is still my audience. I

heartfully thank my friends and family for being there for me, and to my unforgiving

health, a toast of Dom Pérignon for the epic adventure.



1 Introduction

The use of computational methods and biological tools for characterizing protein

structures are a great interest in the scientific community, and especially to structural

biologists. One of the challenges in structural biology is processing the wealth of data

being produced today on proteins; therefore, the use of computer aided tools has been

incorporated to speed up and help understand the structures of proteins, and hence their

functions. Bioinformatics is a field that utilizes computer science, mathematics and

statistics to build an information extraction tool for biological measurements.

Determining the structures of proteins is of fundamental importance for understanding

the functions and behaviour of proteins at the molecular level. Each protein has a native

structure that is specified by its amino acid composition and its secondary structure so

that it attains unique characteristics and can correctly carry out a biological function.

Defects to a protein’s native structure can lead to a number of diseases such as

Alzheimer’s and sickle cell anaemia, rendering the protein incapable of its normal

function.

1.1 Protein Structure

Proteins are polymers of amino acids linked by peptide bond also known as Amide

bonds; for a polymer to be considered a protein, it must have the ability of folding into a

well-defined three dimensional structure.(1) There are twenty different amino acids

distinguished by their side chains known as “R” groups. Proteins have four structural

levels: primary, secondary, tertiary and quaternary.(2) To determine protein structure

from biological samples and for a thorough analysis to be performed on protein data, a

INTRODUCTION

24

detailed understanding of basic building blocks (amino acids) and their properties is

vital. A brief explanation of these structural levels is given below.

Primary structure 1.1.1

The primary structure is the amino acid sequence. Amino acids have a central Cα

(alpha-carbon) which has four different groups attached to it: a hydrogen atom (H), a

carboxyl group (-COOH), an amino group (-NH2) and a side chain (known as the R

group), except for proline which does not have hydrogen on the α-carbon, and glycine

which has two hydrogen atoms.(3) There are twenty common R groups found in amino

acids of proteins. Each side chain gives its amino acid group unique properties; these

amino acids can be either polar (hydrophilic) or non-polar (hydrophobic), they can also

be neutral, positively, or negatively charged. Each amino acid carboxyl group (C

terminal) is covalently bonded to the amino group (N terminal) of another amino acid to

create a peptide bond;(4) The backbone of protein primary structure is made up of a

repeating sequence of these peptide bonds, resulting in long chains called polypeptides

typically made up of 50 to 2000 amino acid residues.(5, 6)

The angles (torsion angles) between any two bonds of the atoms on the backbone are

equal and the distances between the atoms on the backbone are 1.47Å for N-Cα, 1.53 Å

for Cα-C and 1.32Å for C-N (in Angstrom).(7) The protein backbone angles around

these bonds can be described as phi (Φ) for N-Cα, psi (Ψ) for Cα-C bond, and omega

(Ω) for C-N bond. The peptide bond has partial double bond character and this means

that the carboxyl oxygen, the carboxyl carbon and the Amide nitrogen are coplanar and

cannot be rotated freely.(8, 9) The two bonds around the Ca (N-Cα and Cα-C) can

rotate freely provided there is no steric hindrance (see Figure 1.1).



Figure 1.1. This is a diagrammatic representation of the peptide bond of amino acid

residues and the backbone angles that are formed; phi (Φ) for the angle between N-Cα, psi

(Ψ) for the angle between Cα-C bond, and omega (Ω) for the angle between C-N bond.

Secondary Structure 1.1.2

The secondary structure consists of the folding of the primary sequence into local

conformations of the backbone chain. In proteins, the secondary structure is identified

by the shape of regular hydrogen bonds between the Amide and carboxyl groups of a

polypeptide.(2) These bonds form structures such as α-helices, β-pleated sheets, β-

turns, 310-helices, and unfolded structures, of which α-helices and β-sheets are the two

commonly occurring motifs.(1)

1.1.2.1 α-Helices

In the α-helix, the polypeptide chain is coiled into a screw-like shape, leaving the side

chains extended outwardly. The α-helix is stabilized by the bond between the N-H of

the ith

residue and the C=O on the [ i + 4th

] giving the helix 3.6 amino acid residues per

turn; α-helixes present a mean length of 10 residues..(10) A hydrogen bond to the

INTRODUCTION

26

[i + 3th

] residue called the 310-helix produces a tighter coil (see figure 1.2) and spans

about three to four residues. Although this helix type is rarely found as a long helix, the

end turn of α-helix often adopts the conformation of a 310-helix.(11) The name

310 arises because of the 3 amino acid residues per turn and ten atoms enclosed in a ring

formed by each hydrogen. Barlow and Thornton(12) in their analysis of 57 protein

structures showed that 32% of the total number of residues were involved in the

formation of α helixes; while 3.4% of the total number of residues were 310 helices.

They also stated that 310 helixes have an average of 3.2 residues per turn instead of three

residues. An extremely rare hydrogen bond on the [ i + 5th

] residue giving π-helix

produces a loose coil(2) (See figure 1.2).

Figure 1.2. This figure shows α–Helix, 310-helix, and π-helix formation. Dotted red line

shows the bonding pattern that can occur between the ‘i’ and the i+3, i+4, i+5 residues

1.1.2.2 β-Sheets

In a β-pleated sheet, the chains lie alongside each other in either a parallel or anti-

parallel (opposite) direction (see Figure 1.3). The Amide group from one beta-strand

(also β-strand) hydrogen bonds with the carboxyl group in another strand forming a β-



sheet structure. β-strands are typically 3 to 10 amino acids long and their backbones are

almost fully extended.(7) This bonding can be found in separate polypeptide chains

folded back on itself or between regions of the same polypeptide chains.(1) Proteins

may have both α–helices and β–sheets in the same polypeptide; there are other

structures like β–turns which connect other secondary structures and random coils

which is basically all other structures that are not part of the secondary structure

categories.(13)

Parallel β-Sheets

In parallel β-sheet, two peptide strands running in the same direction are held together

by hydrogen bonding between the strands (see Figure 1.3). One residue forms hydrogen

bond to two residues on the other strand separated by a residue in the sequence. That is,

residue i in one strand bonds to residues [j,j+2] in another strand and residue j+2 bonds

to residues [i, i+2].(7) This arrangement may be less stable than anti-parallel because it

produces non planarity in the inter-strand bonded pattern.(8) Each hydrogen bonded

ring in a parallel β-sheet has 12 atoms in it. The peptide backbone dihedral angles (φ,

ψ) are about (–120°, 115°).(10)

Antiparallel β-Sheets

In an antiparallel β-sheet arrangement, the successive β-strands alternate directions so

that the N-terminus of one strand is adjacent to the C-terminus of the next. One residue

forms two hydrogen bond to a single residues on the other strand; therefore residue i on

one strand bonds to residues [j] in another strand and residue i+2 bonds to residues [j-

2].(7) Figure 1.3 illustrates these arrangements. This is the arrangement that produces

the strongest inter-strand stability because it allows the inter-strand hydrogen bonds

http://www.wikipedia.org/wiki/Amino_acid

INTRODUCTION

28

between carbonyls and amines to be linear, which is their preferred orientation.(8) The

number of atoms in the antiparallel hydrogen bond ring is between 10 and 14. The

peptide backbone dihedral angles (φ, ψ) are about (–140°, 135°) in antiparallel

sheets.(10) A strand bonded with a parallel strand on one side and an antiparallel strand

on the other is said to be a mixed strand. Such arrangements are less common and as

such could be less stable than the anti-parallel arrangement.(2)

--> Parallel

<--- Anti-Parallel

Figure 1.3. Parallel and anti-parallel β-Sheet formation

Tertiary structure 1.1.3

Tertiary structure represents the folding of a polypeptide chain into a three dimensional

form that contains its functional regions called domains(14) and these regions are

unique to a particular protein.(4) The tertiary structure of a protein is stabilized by the

different interactions between the side chains, the "R" groups. In addition to the



hydrophobic interaction among nonpolar side chains, major stabilizing forces include

ionic interactions, hydrogen bond interactions, van der Waals interactions and disulfide

bridges; of these, the hydrophobic interactions among non-polar side chains contribute

more to the stabilizing of the tertiary structures.(15) In protein denaturation, the protein

loses its shape as part of the process; as the pH or temperature of a protein is increased,

the protein loses its tertiary structure, therefore its function too. Protein folding is

discussed in section 1.1.5

The overall shape of a protein can be described by the organization of its secondary

structure, therefore the comparison of the structure of a protein is done on the secondary

structure level and this aids in the understanding of structure and function relationship.

Creating a reliable tool that can identify and quantify the secondary structures of protein

has been a concern in the biotechnology industry and that is the object of this research.

The three online databases popular for structure classifications are FSSP-DaliDD(16)

for fold classification based on structure structure comparison of proteins; CATH(17)

for class, architecture, topology, homolog, and superfamily; and SCOP(18) for structure

classification of proteins.

Quaternary Structure 1.1.4

The quaternary level of protein structure occurs due to the association of subunits of

different tertiary structures into a final specific shape which provides the basis for their

functions and activities.(19) While not all proteins form quaternary structures, some

proteins like collagen, a fibrous protein which is found in hair, cartilage and skin, has a

quaternary structure;(20) haemoglobin, a globular protein, exist in quaternary structure.

Haemoglobin is made up of four polypeptide chains with two α-chains each consisting

INTRODUCTION

30

of 141 amino acids, and two β-chains each consisting of 146 amino acids (α2β2).(21)

The 3D shape of haemoglobin is necessary for its oxygen carrying function because

each polypeptide has a haem group with an Fe3+

ion that can bind with O2, which means

it can carry 4x oxygen molecule in the blood.

As mention in the tertiary structure section, the structure classification database SCOP

and CATH are widely used for the analysis and understanding of protein structure.

However they were based on the tertiary structure level or domain level; valuable

information such as the number and type of subunits in a protein quaternary structure

can help to classify proteins with similar domain patterns and functions and this can be

done by using machine learning methods.

Protein Folding 1.1.5

The three dimensional fold of a protein is determined by the amino acid sequence

encoded by the gene during protein synthesis which occurs in the protein building site

of a cell called the ribosome.(22) The genetic information contained in DNA is

transferred to a messenger RNA (mRNA) for translation. After mRNA translation of

the amino acids, the linear polypeptide usually folds spontaneously, although this is

sometimes helped by other proteins known as chaperones which guard against

misfolding.(8) Proteins fold to achieve their lowest possible energy that gives them

their stable structure known as the native state. A common folding intermediate state

which has its secondary structure intact and a loose tertiary structure is known as the

molten globule state.(23) In protein denaturation, the protein loses its shape as part of

the process; as the pH changes or temperature of a protein is increased, the protein loses

its structure, therefore its function too.(24)



Most proteins undergo some form of modification following translation. Post-

translational modifications can result in the attachment of carbohydrate to a protein

which is believed to help prevent proteins from sticking together or to the cell walls;

this addition of carbohydrate to a protein is known as glycosylation.(25) In many cases,

the removal of the carbohydrate leads to protein unfolding or aggregation.

Phosphorylation is another form of post translational modification, phosphorylation

controlling the behaviour of a protein, for instance acting as a toggle function for an

enzyme activity (activation/deactivation of the enzyme).(26, 27) In this way it can

introduce a conformational change in the structure of a protein by interacting with its

hydrophobic and hydrophilic residues;(28) this process can sometimes be reversible

although abnormal phosphorylation can be the cause of many chronic inflammatory

disease.(29)

The study of proteins folding and their post-translational modifications is particularly

important for the study of diseases caused by misfolded proteins such as cancer and

diabetes.(30, 31)

1.2 Experimental Determination of Protein Structures

A 3D structure of water soluble protein can be determined by high resolution techniques such as

X-ray crystallography and NMR but these are not always applicable. Not all proteins,

especially membrane proteins, can be crystalized and NMR has protein size limitation and

cannot observe real time data, in addition, both of these methods can be time consuming.(32,

33) Therefore, there is a need for a method that can probe proteins functional characteristics and

structural changes and can observe proteins in real time. Fourier Transform Infrared (FTIR)

spectroscopy, a widely applicable analytical technique and has great potential to assist structure

analysis for a wide range of proteins. FTIR spectroscopy can provide fingerprints of biological

INTRODUCTION

32

samples in a rapid and non-invasive way and can also provide information about the

conformation of proteins.(13) A background of vibrational spectroscopy will be given first

before explaining FTIR.

Vibrational Spectroscopy 1.2.1

Vibrational spectroscopic techniques measure vibrational energy levels which are

associated with the chemical bonds in a molecule. The sample spectrum is a unique and

so can serve as a fingerprint, so that vibrational spectroscopy can be used for

identifying, characterizing, and elucidating the structures of proteins.(34) Vibrational

spectroscopy is used to describe techniques that can be used to obtain vibrational

information from solid, liquid or gas state samples. Two analytical techniques fall into

these category, infrared and Raman spectroscopies. These two techniques are non–

invasive and, usually, non-destructive to the sample and can be used to probe proteins

for their molecular structure.(34) This section will focus on the uses of IR spectroscopy

as it applies to protein structure.

Fourier Transformed Infrared (FTIR) 1.2.2

Infrared (IR) spectroscopy is a technique that has gained importance in use for the

experimental determination of secondary structure. There is a growing demand for

FTIR spectroscopy, especially for the analysis of proteins, due to its technical

advancements which lie in the fact that FTIR has the simplicity of sample preparation,

the technique has the speed of analysis, it is sensitive and sample measurement can be

done in aqueous solution.(35) With the advent of modified Fourier Transform Infrared

spectroscopy and computational analysis in the late 1980s and 1990s, FTIR

spectroscopy has been successfully applied for the detection, discrimination,

identification, and classification of biological samples belonging to different



species.(13) There is a need for the understanding and characterization of protein

structure in many different industries such as pharmaceuticals and biomaterials, and

FTIR has emerged as one of the few techniques that can be applied for the structural

characterization of proteins in many different environments.(36)

1.2.2.1 The Infrared Regions

The infrared region of the electromagnetic spectrum is divided into three regions: the

near-, mid-, and far-IR (see Figure 1.4). The mid-IR (4000-400 cm-1

) is the most

commonly used region, by chemists and spectroscopists, for analysis of organic

compounds as all molecules possess characteristic absorbance frequencies and

molecular vibrations in this range. Amide bonds absorb electromagnetic energy within

several regions of the mid-IR spectrum giving strong bands in some of these

regions.(37)

Figure 1.4. Electromagnetic Spectrum. frequency (ν) = speed / wavelength(λ). The Mid-

IR region of the spectrum is the wavelength region between 3x10-4

and 3x10-3

cm or about

4000 – 400 cm-1

in wavenumbers

INTRODUCTION

34

Protein IR absorption in the mid-IR region results in nine characteristic modes, the

Amide A, Amide B, Amide I – Amide VII regions are used for protein secondary

structure identification. Of these, the Amide I region, between 1700 – 1600 cm-1

, is the

most sensitive to protein secondary structure and corresponds to the C=O stretching

vibration of the peptide bond. The Amide II band, which covers the spectral region

between 1550 – 1540 cm-1

, represents corresponds to in-plane N–H bending (60%) and

C–N stretching (40%). The Amide III spectral region (1350-1200 cm-1

) is due to a

combination of C-C and C-N stretching and C-H bending vibrations and is called the

fingerprint region because series of absorptions produce different and complicated

pattern of peaks in this spectral region. It is relatively weak in signals, but does not

have the water vibrational band interference which affects the Amide I region.(13, 38,

39) The Amide I and II regions are sensitive to the secondary structure of protein

because the C=O and the N-H bonds are involved in the hydrogen bonding of the

secondary structure of proteins. FTIR spectra from the Amide I, Amide II, and Amide

III regions, can be analysed computationally, using multivariate regression, to extract

structural information.(36)

1.2.2.2 FTIR and Protein Sample Measurement

Vibrations of the peptide bond in the mid-IR region have been explained in section

1.2.2.1 These bonds may absorb at more than one IR frequency; N-H bending

absorption signals are usually weaker than C=O stretching absorption signals.(40) For

infrared radiation to be absorbed by a molecule, the vibrational frequency of the

molecule has to be the same as that of the frequency of the IR photon.(36) The infrared

modes of water may absorb strongly in the Amide I region and overlap with the sample

absorption. This problem may be overcome by use of deuterium oxide (D2O) to

investigate proteins; however, due to the advances made with FTIR spectrometers,



water spectrum can be subtracted, allowing protein samples to be measured in H2O

solvent.(41, 42)

A general schematic of an FT-IR spectrometer is presented in Figure 1.5. The IR source

emits radiation that is passed through an interferometer, usually a Michelson

interferometer with a beam splitter (a semi-reflecting film usually made of KBr), a fixed

mirror, and a moving mirror.(43) The interferometer uses interference patterns to make

accurate measurements of the wavelength of light. When IR radiation is passed through

a sample, some radiation is absorbed and the rest is transmitted to the detector. The

detector measures the total interferogram from all the different IR wavelengths.(43, 44)

A Fourier transform function then converts the interferogram intensity versus time

spectrum) to an IR spectrum (an intensity versus frequency spectrum).(44, 45) The

spectrum is traditionally plotted with Y-axis units as absorbance or transmittance and

the X-axis as wavenumber units. For more information on vibrational bands in the

Amide I, II, and II region of the IR spectra, see Table A1 in Appendix A.

Figure 1.5. A general schematic of an FT-IR spectrometer.

INTRODUCTION

36

1.2.2.3 ATR Absorbance

Attenuated total reflectance (ATR) spectroscopy is based on the concept of internal

reflection, which means the infrared beam of radiation enters the sample in an ATR cell

(crystal) of high refractive index at an angle of 45° (angle of incidence) and is reflected

unto the crystal. This radiation enters the crystal in contact with the sample with lower

refractive index; if the angle of incidence penetrated into the crystal is greater than the

critical angle, the beam undergoes total internal reflection.(35, 42) The material of the

crystal is designed to allow internal reflection; it is the absorption of the evanescent

wave of the penetrated beam into the sample that is measured. The ATR crystal can be

diamond, zinc selenide (ZnSe), or germanium. The penetration depth (Dp) of the

evanescent wave depends on the crystal refractive index (n2), the angle of incidence (θ)

of the penetrating infrared beam to the sample surface, and the wavelength of light

(λ).(44, 46) At the depth of penetration, the evanescent wave intensity decreases to

37% of its initial value.(35) The equation for an ATR penetration depth is given by

equation (1.2).

𝐷𝑝 = (𝜆/𝑛1 )/ { 2𝜋[sin 𝜃 − (𝑛1 𝑛2⁄ )2 ]1 2⁄ }

Where θ is the angle of incidence, 𝑛1 is the sample refractive index, 𝑛2 is the crystal

refractive index and λ is the wavelength. The smaller the parameters, the higher the Dp

(the deeper the penetration). (35, 47) For the ATR experiment used in this research,

𝑛1is 1.4, 𝑛2 (ZnSe) is 2.4 and the θ is 45°. Figure 1.6 demonstrates a schematic of an

ATR technique.

Eq. (1.2)

http://www.azom.com/ads/abmc.aspx?b=9520



Figure 1.6. This diagram illustrates the evanescent wave reaching into the sample at 45°

angle. This angle is called the angle of incidence.

The path length of a transmission experiment is defined by the thickness of the sample

and is therefore the same across the spectrum; whereas, in the ATR experiment, the

depth to which the sample is penetrated by the infrared beam is a function of the

wavelength;(48) therefore, the relative intensity of bands in an ATR spectrum increases

with wavenumber and can cause anomalous dispersion which affects the peak shape.

As a result, the infrared spectrum of a sample obtained by ATR exhibits some

significant differences when compared to its transmission measurement. It is advised to

do ATR correction where quantitative analysis is necessary.(46) ATR spectroscopy

requires little or no sample preparation; it can handle either liquid, solid, film, or

powder samples and is reproducible. It is this ease of use and surface depth

measurement that has prompted its popularity for protein/peptide structural studies.(49)

1.2.2.4 FTIR and Other Methods

With the help of X-ray crystallography, molecular biologists have been able to study the

variability in protein structures, and their importance and functions. However, not all

proteins can form the well-ordered crystals required for diffraction of X-rays to high

INTRODUCTION

38

resolution and the process of crystallizing protein molecules can be difficult and time

consuming.(50) Other methods like Nuclear Magnetic Resonance (NMR) are used to

study proteins in solution but NMR spectroscopy is restricted by protein size of less

than 200 residues.(34, 51) Fourier Transformed Infrared (FTIR) gives high quality

spectra without the problems with florescence background found with Raman

spectroscopy or size limitation found with NMR.(52) FTIR requires little sample

preparation and is fast and simple to use for spectral acquisition. It is non-destructive to

the protein sample during measurement; qualitative as well as quantitative analysis of

protein content can be acquired. FTIR has become a powerful tool for examining the

conformations of proteins in H2O as well as in deuterated forms (D2O).(13, 53-55)

1.3 Computational Analysis Theory

As of September 02, 2014, there are more than 103015 entries in the Protein Data Bank

(PDB), which is a resource that is routinely used by most biomolecular scientists and

engineers conducting studies on proteins. PDB protein entries have been determined

experimentally by X-ray crystallography and some by Nuclear Magnetic Resonance

(NMR)(56) and are used as the de-facto gold standard for most regression models used

for protein analysis. Other experimental methods, like IR, Raman and CD

spectroscopies, have emerged to ease or enhance the probing of protein structure in

different environments; however, experimental work produces lot of data than can be

analysed. There have emerged computational methods for automating and improving

protein analysis by separating information from the noise in protein spectra.

Multivariate protein data produced by modern instruments can now be analysed,

revalidated and revised.(57, 58)



The principal difficulty in analysing the Amide I band profile for secondary structure

analysis is due to the shift in the band being smaller than the intrinsic band width; this

results in this region generally appearing as broad lumpy peak instead of a series of

discrete resolved peaks.(40, 59) Due to this complexity the separation of the component

bands becomes difficult. Numeric or statistical methods have been applied to resolve

the component features of the Amide I band in order to analyse protein structure. In

addition, spectral data are multicollinear in nature because the data consist of

continuous signals;(60) it therefore becomes necessary to apply multivariate analysis for

their calibration and to extract useful information from the data.(61, 62)

Multivariate Analysis 1.3.1

Many scientific investigations, including clinical and pharmaceutical tests, essentially

aim to determine the composition of a mixture of chemicals or the structure of

biological samples and to analyse these samples in large quantity and rapidly.

Multivariate data analysis is a statistical technique that has made a significant impact in

the area of computational biology. The main goal of using multivariate analysis for

complex, particularly biological, data is to analyse large data by the use of mathematical

models, to classify the data, to derive the relationships between samples through

modelling and to view results graphically.(63) One such computational method that has

been applied for the elucidation of protein secondary structure is Multivariate

Regression (MVR), and it has proven useful for studying variability in spectral data.(64)

Multivariate analysis methods such as Principal Component Analysis (PCA) and Partial

Least Squares (PLS) are very useful methods for identifying patterns in complex data,

an overall view of the entire data and the significance of and differences between data

INTRODUCTION

40

groups by using all relevant variables within the data.(65) The use of PCA in the

biotechnology world has helped in the visualizing and better understanding of the

underlying structure of data or samples in an unsupervised way.(66) PCA has been

used to in the diagnostics of many diseases including the identification and

characterization of biomarkers in cancer cell lines.(67) In food science, PCA has been

used to identify many products including unknown fruits, wines and vegetable oils.(68)

In FTIR spectra of proteins the absorbance of the sample as a function of wavenumber

is affected by chemical and physical properties of the sample as well as being

complicated by overlapping absorption of structural bands.(69) Multivariate analysis

removes the near multicollinearity found in spectral measurements; collinearity

meaning that the different regions of the spectra may convey the same, and therefore

redundant, information. Both Principal Component Analysis (PCA) and Partial Least

Squares (PLS) were developed to solve the near multicollinearity often found in spectral

measurements and the overfitting issues found with predictor (X) variables being larger

than the number of samples;(70)

such a model is said to have a lack of degree of

freedom and thus is over-fitting.(71) PCA (which is explained in the next section) is

considered to be an unsupervised learning method because it does not need a reference

data set for training the model, whereas PLS requires a reference set which spells out

some prior knowledge of the data groupings. The general equation notation of MVR is;

where Y is the matrix of the predicted variable, X is the predictor or observations

variable, β is the determined coefficients during calibration, and ε is the residual error.

Figure 1.7 shows a schematic illustration representing the concepts presented by

equation. 1.3.

Eq. (1.3)

Y = Xβ



Figure 1.7. Diagrammatic representation of the generalized multivariate equation

represented in eq. 1.3. ‘Y’ is predicted on the basis of ‘X ‘and in such a way that the

residual, ‘ε’ is minimized.

1.3.1.1 Unsupervised learning

There are several unsupervised methods used in the bioinformatics and chemometrics

fields today including PCA, Factor Analysis, and Singular Value Decomposition

(SVD). PCA, which was used in this research, is similar to the mathematical SVD

method except that PCA reduces high dimensional data to a lower dimension in order to

reveal any hidden information in the data. The simplicity and usefulness of PCA has

made it one of the most widely used methods for spectroscopic data analysis; (72-74)

PCA also finds the relationship between observations by finding similar factors (their

correlations) in the observations.(75, 76) The variables of ‘X’ (which are the vectors of

intensities at each wavenumber) are mean centred as explained in section 1.3.2.1; then

the correlations among variables (factors) are computed in the form of a correlation or

INTRODUCTION

42

covariance matrix. The covariance of two variables is their tendency to vary

together.(77) For a sample dataset, this can be written in matrix representation as;

where x is the mean(X), y is the mean(Y), N is the number of entries (observations) in

the dataset. Certain characteristic values referred to as eigenvalues and eigenvectors of

the covariance matrix are calculated as these tell us useful information about the data.

The eigenvectors are variables (intensities at each wavenumber) that form axes in a

multidimensional space; each axis length is determined by scaling the columns of the

data. The eigenvalues are the projected point of each sample on the axis. The

eigenvector (loadings) with the highest eigenvalue (scores) is the most dominant

principle component (PC) of the dataset and these PCs are arranged in order of

importance, that is, PC1 is the component with the largest variation, PC2 is the

components with the second largest variance, and so on. The number of components

that describes close to 100% of the data are extracted based on their significance,

leaving the residual as noise, in this way, PCA is used for dimension reduction.(75)

Each principal component is characterized by the scores ‘T’ (the eigenvalues) and the

loadings ‘P’ (the eigenvectors)(78) so that the product of T*P

T represents the original

data. This notation is given by:

where X is the matrix of spectra of observations, T is the score values, P is the loading

Values, and E contains the residuals also known as noise. In summary, scores are

X = T.PT + E

Eq. (1.4)

N

i

ii

N

yyxxYX

1

))((),cov(

Eq. (1.5)



linear combinations of the original variables with weights defined by the loadings,

that is, scores are scaled matrix X * P (eigenvector). PCA scores can be plotted to show

the similarities among samples while the loading plot shows similarities in variables

which are the measured intensities at each wavenumber. The class membership of these

objects or samples may not necessarily be known in advance as this may be inferred

during PCA data exploration.(75) Outliers in the PCA do not belong to a known class;

therefore, PCA can detect them during exploration.

1.3.1.2 Supervised Learning

A supervised learning method requires a prior knowledge of the data grouping via a

reference data set. Supervised learning methods include Discriminant Factor Analysis

(DFA) and Multivariate Regression Analysis (MVR).(64) DFA is a supervised

clustering method which determines to what group an individual data belongs. MVR is

used to model the linear relationship between independent variables (the experimentally

observed) and dependent variables (a matrix of FTIR measurements). Input parameters

for most supervised methods are the independent variables (protein spectra) which are

labelled x1...xn, and the dependent variable which is the vector of the fractions of each

secondary structure motif for the proteins.(79) A latent variable, the discriminant

factor, is created from the spectra wavenumbers such as specified in equation 1.6.

where ‘a’ is a constant, ‘b’ is the discriminant coefficients, and ‘x’ is the discrimination

variables. Among the different multivariate calibration methods, Principal Components

Regression (PCR) methods which is based on PCA, and PLS, have received the most

Eq. (1.6)

nn xbxbxbal ...22111

INTRODUCTION

44

attention for the quantitative analysis of spectral data because it is easier to interpret

their models and they work well with small sample size.(70)

PLS (Partial Least Squares)

PLS is a multivariate regression method based on the generalized equation 1.3 where β

is the calculated correlation coefficient and ε is the matrix of experimental noise.(70,

80) In PLS, the value of β is obtained by;

Here, ‘X’ represents the independent measured FTIR spectra and ‘y’ represents the

known secondary structure compositions of the investigated proteins as determined by

X-ray crystallography; ‘X’ is also called a ‘training’ or ‘calibration’ set. PLS regression

is used to predict or determine a new value of ‘y’ from ‘X’ (the FTIR spectral

matrix).(81) PLS modelling captures the variation (or variance) in the spectral matrix in

a low dimensionality model. The total variance of the model is made up of the

Explained Sum of Square (SSE) of the variation of the data, and the unexplained

variance – the residual (SSR).(75) Therefore, the sum of squares of the total variance of

the data set (SST) can be written as: SST = SSE + SSR Eq. (1.8)

The explained variance (also called the goodness of fit) of the data in a model is

calculated as:(75) Explained variance = ((SST – SSR )/ SST)*100

PCR (Principal Component Regression)

The PCR method is based on the basic concept of PCA. PCR performs data

decomposition into the loading and score variables used in building a model. For PCR,

the estimated scores matrix consists of the most dominating principal components of X

Eq. (1.7) yXXX TT 1)(



and can then be used as input for a regression method. The equation then takes the form

of;

+ εY = XTk

where Y is the predicted response variables, X is the matrix of independent Variables, T

is the matrix of PCA scores and ε is the residual error.(70) As in PCA, these

components are linear combinations of X determined by their ability to account for the

variability in X. The first principal component is computed as the linear combination of

the original X-variables with the highest possible variance; however, unlike PLS, PCR

uses only the X variables for the analysis without employing the ‘y’ variable (the

reference set).

Comparing the two regression methods, PLS regression is known to give good

prediction results with fewer components than PCR because PLS looks for components

that explain the correlation between X and y; the response variable ‘y’ is employed in

the regression. However, the two methods often give comparable prediction results.(61,

70)

Data Pre-processing 1.3.2

Before performing multivariate analyses, data pre-treatment is generally required in

order to reduce the effect of noise, improve the predictive ability of the model and

simplify the model by making the data more normally distributed. Data transformation

is a form of pre-treatment by means of a mathematical formula to change the

distribution of the data, examples of this being logarithmic and exponential

Eq. (1.9)

INTRODUCTION

46

transformation.(61) If the average of each variable is calculated and then subtracted

from the variables, the data are said to be mean-centred. Unit variance scaling means

that the standard deviation (s) for each variable is calculated and by multiplying each

value of that variable by 1/s, all variables are then given an equal weight.(82) This is

also sometimes called variance scaling. These forms of pre-treatments will be

described below.

1.3.2.1 Mean Centring

The reason for centring is to break up the high correlations between each of the main

effect variables and the interaction term, this helps computational accuracy and numeric

stability. (83) Centring converts all the observation points to fluctuations around zero

and not around the mean of the original data. This adjusts for the differences in the

offset between the high and the low numbers.(83) Each variable, Xij, is centred by

subtracting the column mean, Xj, of the data. See eq. 1.10a and eq. 1.10b.

here i is the row index, j is the column index, and x j = column mean.

1.3.2.2 Scaling

In spectroscopic analyses, the difference in the scaling of spectra may arise from the use

of different instruments or scattering effects or path length and other effects.(84, 85) A

scaling approach is useful when the impact of the low wavenumbers needs to be

considered, it is a way to relatively reduce the influence of larger peaks and increase the

Eq. (1.10b)

xxxfjij

)(

n

iijj xx

n 1

1

Eq. (1.10a)



impact of the smaller spectral features. Small features may be deemed less important in

a multivariate analysis as they do not contribute much to the variance of the data.

Caution should be taken as the inflation of small values, which includes any inflation of

measurement error or instrument misalignment may create an increased danger of

altering the biological meaning of the results.(86, 87) The effectiveness of scaling is

based on data analysis and cannot be predicted in advance.(83) Scaling methods like

normalization, standardization, and multiplicative scattered correction used for

multivariate regression analysis, will be described in this section. Figure 1.8 a-c

illustrates the principles of centred and scaled data.

Figure 1.8a Figure 1.8b Figure 1.8c

Figure 1.8a-c. Figure 1.8a shows the original data points. Figure 1.8b: shows the centred data

which are the individual points minus the mean of the data; this puts the new mean at 0, making

some of the data negative. Figure 1.8c. Scaled data: data looks shortened and broadened in an

attempt to decrease group distance. This results in a zero mean and unit variance of any

descriptor variable.

1.3.2.3 Standardization

Auto-scaling, also known as standardization, is scaling in which each variable is

subtracted from the column mean and divided by its standard deviation; this method

gives the data a standard normal distribution with a zero mean and a unit variance of 1

(the Gaussian distribution).(83) Standardization is performed through;

INTRODUCTION

48

where xij is the variable in j

th Column, x j

is the mean of the column, and S j =

standard deviation of column

1.3.2.4 Normalization

Normalization or range scaling is useful in the identification of biological samples and

for the relationship of the variables in a spectrum as this aids the quantification of the

sample. Range scaling is done by subtracting the spectrum column minimum value

from each column value and dividing by the range of the column (the column maximum

value minus the minimum value), in other words, the length of the data. In this row

operation all sample data are made compatible with each other and spectrum data is

scaled to values between 0 and 1.(88) Range scaling is described in eq. 1.12. Data will

be scaled to a smaller interval if an outlier is present, therefore outliers should be

removed.

where min(xi) is the minimum value for variable xi, max(xi) is the maximum value for

variable xi.

Eq. (1 12)

Eq. (1 11b)

s

xxxf

j

jij)(

1)(0)min()max(

)min()(

xf

xx

xxxf

ii

iij

1

)(1

n

n

ijij

j

xxS

Eq. (1 11a)



1.3.2.5 Multiplicative Scatter Correction

Multiplicative Scatter Correction (MSC) is another popular scaling method used for the

correction of baselines and for scaling.(89). Multiplicative scatter correction known as

a method in spectroscopy used to correct signals for light scattering and noise or

changes in path length for samples where each sample is estimated relative to that of an

ideal sample.(89) Spectra in a dataset are corrected so that all samples appear to have

the similar scatter level. The correction is done by regressing the measured sample

spectrum against a reference spectrum and then the slope of the fit is used in correcting

the measured protein spectrum.(90) In MSC, as described in eq. 1.13a-b, a reference

spectrum is the mean spectrum of the dataset (r) and the spectrum to be corrected is

defined as (xi). An offset correction like mean-centring is best performed on vector ‘r’

and ‘x’ before the regression.(91-93) For each sample a least squares regression is done

to get an estimate of an unknown factor ‘b’ as shown in eq. 1.13b; this is then used in

the correction of each ith

spectrum in the dataset.

where x = MSC corrected spectrum, Xi = Spectrum in the dataset, b = estimated

unknown factor from eq. 1.13b, and r = the mean spectrum of the dataset.

1.3.2.6 Outliers

There can be unexpected variability in observations, which are called outliers. Outliers

are extremely low or extremely high values in the data set(94) as depicted in Figure 1.9;

Eq. (1.13a)

Eq.( 1.13b)

i

TT xrrrb 1)(

b

xxmsc iˆ)(

INTRODUCTION

50

these outliers are different from the rest of the data due to several reasons; instrumental

errors, or sample error which could be due to incorrect sample labelling or simply a

sample from another data population, or even noise.(92, 93) These factors can have

considerable influence on the calibration. This discrepancy can be eliminated by

removing any outliers after thorough investigation.

0.1 0.2 0.3 0.4 0.5

wavenumber

inte

nsi

ty0

.10

.20

.30

.4

outlier

Figure 1.9. Outlier plot showing a red dot as an outlier, which clearly does not correlate

with the rest of the data.

1.3.2.7 Detection of Outliers

Outliers can be detected by looking at a PCA score plot or looking for a sample residual

with a high number which may mean that the measurement of that sample is erroneous

or less well described by the PCA model. Eq. 1.14a and 1.14b show a simple statistical

formula for detecting an outlier. Firstly, all data points for a given data set are ordered.

The mean of the first half is dedicated as Q1 and the mean of the second half is

dedicated as Q3. The subtraction of Q1 from Q3 gives an Inter-Quantile Range. Any



data point (xi) larger than Q3 is an outlier, likewise any xi smaller than Q1 is an

outlier.(95)

𝑋 > 𝑄1 + 1.5(𝐼𝑄𝑅)

𝑋 < 𝑄3 + 1.5(𝐼𝑄𝑅)

where: Xi is the data point, Q is a quantile range, IQR is the inter-quantile range. In

multivariate analysis, this is calculated by equation 1.15 which represents the

Mahalanobis Distance which is the square of the distance from the point x to the mean μ

in units of the standard deviation;(96)

where’ x’ is the datapoint, ‘μ’ is the mean of the data and ‘δ’ is the standard deviation.

1.3.2.8 Derivatives

The first and second derivative transformations reduce replicate variability, correct

baseline shift, resolve overlapping peaks, and amplify spectral variations. Broad

overlapping bands of raw spectra become better resolved in a first derivative because it

removes the additive baseline, but it also adds a straight line with a constant slope to the

original spectra;(71) therefore, it is not suitable for obtaining quantitative assignments.

If the first derivative is the slope at each point of the raw spectra, the second derivative

is the slope of the first derivative.(89) The second derivative removes the linear

baseline and thus the ambiguity and complication of assigning the positions of the

features is removed.(97) For this reason, there is an increase in the number of

discriminative features associated with protein spectra and this improve the clarity of

Eq. (1.14b)

Eq. (1.14a)

(x − μ)’δ−1(x − μ) Eq. (1.15)

INTRODUCTION

52

the spectra. Second derivative spectra are commonly used for protein classification;

however, data smoothing should be done first because first and second derivatives

magnify the noise in the spectroscopic data(69). The formulas for the first and second

derivatives are represented in equation 1.16a and 1.16b respectively.

where wy is the measured spectrum at wavenumber w

Validation of the model 1.3.3

Models always have to be validated in order to verify and ensure that they really hold

true in the experimental domain investigated. The validation is needed in order to be

able to decide whether or not the conclusions drawn from it are reliable,(98) i.e. to make

sure that the results can be extrapolated to new data. To estimate the prediction ability

of a model, cross-validation of the data is done by keeping sample or a set of the data

out of the regression analysis and using the model from the analysis to predict the left

out set; this is done for each set of the data which is left out of the regression only once.

The errors from the predictions called Root Mean Square Error for Cross-Validation

(RMSECV) are averaged; (99)(100-102) more information on cross-validation is given

in the Multivariate Calibration Chapter 5. Cross-validation is done to eliminate the

tendency to overfit with too many PLS or PCR components. Number of components

with the lowest RMSECV is always desired and is used to fit the data. It is also good to

know how well a model predicted the data in an independent test set, for this the Root

Mean Squared Error for Prediction (RMSEP) is calculated. To calculate the RMSEP the

individual errors are squared, added together, divided by the number of errors, and then

Eq. (1.16a)

Eq. (1.16b)

1)( ww yyxf

11 2 www yyy



square rooted. This gives a single number which summarizes the overall error.(103) The

formula for generalized Root Mean Square Error is given in eq. 1.17.

Determination of the unknown 1.3.4

Selection criteria are necessary for the determination of unknown samples as the

training set should have sufficient representation of the group of samples. A large

database is necessary to represent all future samples that may be analysed by the

model.(104)

1.4 Aim of this research

The goal of my research is to develop computational tools that can give better

accuracies for the quantitative analyses and classifications of IR protein spectra to

determine their secondary structures and tertiary folds.

The computational tools developed in this research can contribute to the assessment and

the interpretations of the complexity of protein structures, both in native and denatured

states. The protein structure investigative tools were made by combining machine

learning, multivariate analysis (MVA) methods and Fourier Transformed Infrared

(FTIR); these tools are applicable in real life.

(Eq. 1.17) )ˆ(

1

1

2

i

n

ii

yyn

RMSE

INTRODUCTION

54

Research Objectives 1.4.1

Although the X-ray crystallography technique in determining protein structure is highly

accurate, it is time consuming and not applicable to all proteins; NMR is also limited

by protein size and is not easy to analyse, therefore, complementary methods are

required for proteins where the structure is not easily determined using these techniques.

New methods may also be useful for other purposes, such as to rapidly identify the

critical regions of a spectrum for any given secondary structure motif and to investigate

important structural transitions and processes such as protein folding or aggregation. A

technique, rightly suited for these types of studies is FTIR spectroscopy, the

experimental method chosen for the research presented in this thesis.

Infrared spectroscopy can characterise protein secondary structure conformation;

therefore, it was used in this research to determine the secondary structure of different

motifs in the 84 proteins in H2O solution chosen for this research. FTIR protein

measurements done in-house increased the dataset from its initial count of 28 provided

proteins. The secondary structures of reference proteins from X-ray crystallography

were obtained using Defined Secondary Structure of Proteins (DSSP) script.

To develop a good regression model, the performance of different regression

techniques, PCR, PLS, iPLS, on 84 protein samples, were evaluated; the model with the

optimal performance in terms of predictive results based on the lowest Root Mean

Square Error (RMSE) and Coefficient of Determination (R2) was chosen. To enhance

the predictive ability of each developed model, different data pre-treatment methods, i.e.

cantering, normalization, Standard Normal Variate (SNV), 2nd

derivative and Kennard

Stone data partitioning algorithm, were tested on the ATR-FTIR protein spectra dataset.

They were found to greatly affect the outcome of the data analysis. These methods and



results with very good accuracies are discussed more in details in each Chapter of this

thesis.

One main objective of this study is to apply a predictive model to estimate the

secondary structure content at each pH of α-lactalbumin. The spectral features of the

denatured protein were evaluated using Principal Component Analysis (PCA) as this

analytical method can be used for data reduction and to explore the data correlation and

understand the biological differences between protein spectra. The most importance

spectral bands were identified using the PCA loadings plot; the identified spectral bands

linked to the discriminating features for each spectrum. Previous spectroscopic studies

have not made an accurate estimation of the secondary structure content of a denatured

α-lactalbumin at each pH level, as has been done in this study.

Secondly, to evaluate the predictive quality of the model, the limit of quantification for

6 proteins (BSA, Ubiquitin, β-lactoglobulin, lysozyme, fibrinogen and IgG) was studied

by analysing the structural content of each protein at the lowest possible concentration.

This study showcases the strength of using FTIR spectra and multivariate analysis for

the detection of low protein concentration in a solution. The results are comparable and

in some cases better than the Circular Dichroism (CD) method which is known to work

well with low protein concentrations.

EXPERIMENTAL METHOD

56

2 Experimental Method: ATR-FTIR spectroscopy

2.1 Introduction

Functional groups present in a molecule tend to absorb IR radiation in the same

wavenumber range regardless of other structures in the molecule, and spectral peaks are

derived from the absorption of bond vibrational energy changes in the IR region.(43)

Thus, there is a correlation between IR band frequencies and chemical structures in the

molecule. In addition to providing qualitative information about functional groups, IR

spectra can provide quantitative information, such as the percentage of the secondary

structure of proteins. For this research, the ATR-FTIR sampling technique was used for

the spectral measurements. The most common sampling techniques used for protein

characterization are transmittance, diffuse reflectance, and attenuated total reflectance

(ATR).(39)

For more explanation on ATR-FTIR techniques, see Chapter 1 – Spectral Analysis.

Analysis of the structural content in the infrared bands, particularly from the Amide I

(1700cm-1

– 1600 cm-1

) and Amide II (1600 cm-1

- 1500 cm-1

) regions will be the main

focus as these have been the most commonly used in the literature for protein studies.

Before any type of analysis was done to elucidate the structure of proteins, data pre-

processing was carried out to eliminate noise, outliers and missing spectral values,

especially in the Amide III region.(105) This section focuses on the collection and

preparation of samples for mathematical evaluations.



2.2 Sample preparation

Ninety proteins were measured in H2O; 28 of these were supplied by Dr. Parvez Haris

from DeMontfort University (highlighted in light grey in Table A2, Appendix A), and

the rest were collected here at the University of Manchester. Proteins listed in Table A2

in Appendix A were bought from Sigma-Aldrich, and were used without further

purification. Protein samples were prepared at a concentration of 50 mg/ml dissolved in

distilled H2O. Using concentrations at 25 mg/mL or higher provides good precision and

reproducible, quantifiable results;(106) however the study of quantification at lower

protein concentrations down to 0.2mg/ml is discussed in Chapter 5. The pHs of these

samples were recorded and all ATR-FTIR spectra were measured shortly after sample

preparations.

2.3 Spectral measurement and Processing

The ATR accessory, made of a zinc selenide (ZnSe) crystal, was placed into the sample

compartment of a Bruker-Tensor FTIR spectrometer, with an MCT detector being used,

for the spectral measurements. 30 µL of each protein solution was placed on the ATR

cell. A background spectrum of H2O used in the protein preparation was measured for

each protein; this was used as the background signal subtracted from the protein sample

signal for each protein. IR measurements were made with 4cm-1

spectral resolution, 32

scans per sample, over a range from 4000 - 950 cm-1

. Small spectral peaks with

resolution lower than 4 cm-1

tend to disappear and may not be quite suitable for

qualitative examination of spectral components. At higher resolution, the signal to

noise ratio may be low; typically higher resolution (e.g. 1 cm-1

) is used for gas phase

spectra.(35) Acquisition time for each spectrum was less than two minutes for both

background and spectral measurements.

EXPERIMENTAL METHOD

58

Baseline Correction and Atmospheric Correction 2.3.1

The baseline of a measured spectrum should have flat baseline at zero intensity.

Baseline offsets are normally removed by subtracting the value of the minimum

absorbance from the spectrum. This correction was performed for all spectra in the

dataset to remove background absorbance and light scattering effects. Water

subtraction was automatically done by a customized routine in the OPUS 6.2 software

for every spectrum and no smoothing function was applied.

ATR correction 2.3.2

The principles of attenuated total reflection (ATR) have been explained in Chapter 1,

with ATR now being a commonly used sampling technique in FTIR spectroscopy, as

this approach requires minimal sample preparation. However, one of the challenges of

ATR is that its infrared spectra are not identical to that measured in transmission. ATR

spectroscopy introduces some distortion in relative band intensity and absolute shifts to

lower frequency(107). The result of this is higher absorbance bands around the Amide

II region than the Amide I region of a spectrum and this difference between ATR and

transmission spectra may lead to incorrect interpretations. Using the ‘Advance ATR

Correction’ feature on Bruker’s OPUS software, these abnormalities were corrected on

all in-house collected spectra. This distortion can be seen for the fibrinogen spectrum

shown as the blue trace in Figure 2.1a. The Advanced ATR Correction corrected

differences in peak positions and relative intensity ratios of bands in the Amide I and

Amide II regions as seen by the resulting green trace. The new ATR corrected spectrum

should improve the reliability for spectral searches of fibrinogen in spectral libraries.

However, the observation with most spectra in this dataset is that the ratio of Amide I to

Amide II peaks normally maintained by IR transmission spectra was not obtained. This



is evident in the plot for carbonic anhydrase in Figure 2.1b showing ATR-corrected with

0.9 to 1 ratio for Amide I to Amide II peak intensity while the transmission spectra of

the same protein has a 0.6 to 1 ratio for Amide I to II. The observation is that ATR

correction can be less satisfactory for samples with strong absorption peaks.(108)

Figure 2.1a. Blue spectrum of fibrinogen is ATR-corrected to attain the correct Amide

I:Amide II band ratio of its transmission counterpart.

Figure 2.1b. Amide I – Amide II transmission band ratio is not totally attained by the ATR

corrected carbonic anhydrase spectrum (in Blue).

1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 18500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Wavemumber (cm-1

)

inte

nsity

FTIR/ATR

FTIR/ATR Corrected

1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 18500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Wavemumber (cm-1

)

inte

nsity

ATR-corrected Spectrum

Transmission Spectrum

EXPERIMENTAL METHOD

60

2.4 Data Pre-processing for Multivariate Analysis

The proteins’ secondary structural motif fractions were obtained using Defined

Secondary Structure of Proteins (DSSP)(109) script. The Protein data bank (PDB) does

not give the percentage of secondary structure assignments beyond α-helix, β-sheet and

other; however the PDB file for each protein contains the structural information which

can be fed into the DSSP script. The DSSP code assigns eight secondary structures of

which six are popularly used in protein prediction namely, α-helix, β-sheet, π-helix, 310-

helix, turns, and coils (other); denoted by (H), (E), (I), (G), (T), (“ “).(110) A perl script

was developed in-house to extract these structural compositions from PDB files using

the DSSP script and all spectra pre-processing steps were performed using Matlab

versions R2010 and R2013. Of the six secondary structure assignments, π-helix was not

considered for this project because it is a rare protein secondary structure element;

however, β-sheet structure was further split into parallel and antiparallel β-sheets. See

Appendix A, Table A2 for the 90 proteins and their secondary structure fractions as

derived from DSSP script. Data pre-processing was done on all collected protein spectra

to identify or remove outliers, checking for inconsistencies and missing values and

normalizing the intensities of the data. Figure 2.2 shows the raw spectra of the 90

protein samples.



Figure 2.2a. ATR-FTIR Spectra of the 90 proteins in H2O solution (spectra of the Amide I

and Amide II region from different instruments) before scaling, removal of outliers and

second derivative.

Figure 2.2b. ATR-FTIR spectra of the 90 proteins in H2O solution (spectra of the

Amide I and Amide II region from different instruments) after normalization but before

outlier detection.

1400 1450 1500 1550 1600 1650 1700-0.2

0 .0

0.2

0.4

0.6

Wavemumber cm-1

Absorbance

1400 1450 1500 1550 1600 1650 17000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Wavemumber cm-1

Norm

alized inte

nsity

EXPERIMENTAL METHOD

62

Outlier Detection 2.4.1

For proteins in H2O, the Amide III region often had missing data, therefore only the

Amide I and II regions of the spectra were used; these spectra with missing values were

the 28 protein spectra not collected in house (mentioned in section 2.2). Samples with

the Amide III region were used for some experiments that required the study of fewer

proteins; e.g., the study of protein denaturation as seen in Chapter 5. Spectra that were

too noisy to resolve were also removed from the dataset. The removal of unusable

spectra reduced the database size from 90 to 84. These six spectra are: amyloglucanase,

galactose oxidase, d-glucosamin, isocitric dehydrogenase, spectrin, and theamalysin.

Statistical methods for multivariate outliers indicate those observations far from the

centre of the data distribution as outliers.(93) Mahalanobis distance was used for the

calculation of outliers and this is described in section 1.3.2.4 (Detection of Outliers) of

Chapter 1, but it is basically involves a measure of the square of the distance of each

spectrum to the mean of the spectra. Any distance outside the calculated radius,(93, 94)

as given by equation 1.4, is an outlier. These steps are listed below:

1. Spectral matrix: An n × d data set, S.

2. Calculate the mean (μ) and variance-covariance (Σ) of the matrix.

3. Let M be n × 1 vector consisting of square of the Mahalanobis distance to μ.

Where n is the number of samples in the dataset and M is the Mahalanobis

distance, equal to (x − μ)’ Σ -1

(x − μ)

4. Find points ‘p’ in M whose value is greater than the significance level

[𝑖𝑛𝑣(√𝑥𝑑2(.975)) ] : χ

2 (chi square) distribution with ‘d’ degrees of freedom.

5. Return ‘p’ (This is the outlier with value greater than the significant level)



The plot of the mean and standard deviation of each sample in Figure 2.3a shows that

protein number 88 (d-glucosamine) and protein number 22 (spectrin) are above the

significant level. D-glucosamine differs from the rest of the proteins because it is a

sugar and not a protein and belonged to a different dataset. The spectrin sample could

have been affected during measurement, if left in the dataset it shows as an outlier and

affects the results of the spectral analysis and so was taken out of the 29 samples sent

from Dr. Parvez Haris of DeMontfort University. An evaluation of their mean

prediction errors shows them to be higher than normal and they were therefore treated

as being unreliable. These samples were removed as shown in Figure 2.3b.

Figure 2.3a. Outlier detection using mahalanobis distance. Plot shows D-glucosamine

(sugar) and Spectrin as outliers. The mean of each spectrum and the standard deviation of

each sample are plotted on the x-axis and y-axis, respectively

EXPERIMENTAL METHOD

64

Figure 2.3b. Distribution of protein samples after outliers were taken out.

Centring and Scaling. 2.4.2

Intensities of the 84 protein spectra measured in H2O were scaled to prevent spectra

with strong peaks from obscuring spectra of smaller bands with equally significant

spectral features.(84) The dataset is divided into Amide I, and II, each region was

selectively scaled separately to ensure the significance of each contributing spectrum.

Normalization or range scaling used for this dataset is useful in the identification of

biological samples and for the relationship of the variables in a spectrum as this aids in

the quantification of the sample.(88). Range scaling is done by subtracting the spectrum

column minimum value from each column value and dividing by the range of the

column (the column maximum value minus the minimum value); see section 1.3.2 in

chapter 1 for more information on scaling. In this row operation all sample data are

made compatible with each other and spectral data are scaled to values between 0 and

1.(88) Standardization, which is a column operation also used for scaling, was

performed on the dataset for comparison purposes. Chapter 1, section 1.3 gives the



formulae for normalization and standardization. Standardization is useful when the

precise number of biological components varies but the relative proportions of the

sample structure, as in the percentage of their secondary structure compositions, can be

measured quantitatively(111). In the methods below, both normalization, and Standard

Normal Variate (SNV), were first performed separately and then together to evaluate

their effect. The scaled data are shown in Figure 2.4.

Figure 2.4. 84 scaled protein spectra (Amide I region) measured in H2O

Multiplicative Scattered Correction (MSC) 2.4.3

Both the original and scaled data were further pre-treated with MSC as described in

section 1.3 of Chapter 1. MSC is a relatively simple processing step that attempts to

account for scaling effects and offset (baseline) effects(112). This method should

correct the multiplicative effect on all other samples to the level of the mean sample.

Judging from the plots below (Figure 2.5a and 2.5b) it may be easy to conclude that

1600 1610 1620 1630 1640 1650 1660 1670 1680 1690 17000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

inte

nsity

Wavemumber (cm-1

)

EXPERIMENTAL METHOD

66

multiplicative scatter correction was effective because the spectra are tightly grouped

with better separated peaks than that of the Figure 2.4, but this will only be confirmed if

regression analyses on this data give lower prediction errors.

Figure 2.5a. MSC pre-processing performed on 84 raw data showing reduction in spectral

intensities. Spectra are pulled towards the raw spectrum with the mean value

Figure 2.5b. MSC pre-processing performed on normalized spectra. Spectra are pulled

towards the normalized mean sample.

1600 1610 1620 1630 1640 1650 1660 1670 1680 1690 1700-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25in

tensity

Wavemumber (cm-1

)

1600 1610 1620 1630 1640 1650 1660 1670 1680 1690 1700-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

inte

nsity

Wavemumber (cm-1

)



Standard Normal Variate (SNV) 2.4.4

Figures 2.6a and 2.6b below show the result of SNV on the protein samples measured in

H2O, by subtracting the mean of the data from each value and dividing by their standard

deviation, which gives each sample a unit deviation of 1. Judging from the intensity of

the plots, it appears that SNV scaling gave a much tighter grouping of the data when

they were normalized than without normalization; however, Table 2.1 shows the

minimum and maximum intensities of the spectra after scaling with each of these

methods. SNV and MSC plots look very similar in their peak separation and clustering;

their similarity will be compared in Chapter 4 for regression analysis.

Figure 2.6a. SNV pre-processing performed on 84 protein spectra of raw protein samples

1600 1610 1620 1630 1640 1650 1660 1670 1680 1690 1700-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

inte

nsity

Wavemumber (cm-1

)

EXPERIMENTAL METHOD

68

Figure 2.6b. SNV pre-processing done on normalized 84 protein spectra

2nd Derivative 2.4.5

Since spectra were recorded on different days and in different labs, which are factors

known to affect spectral response, another way to correct instrumentation errors that

was investigated was to calculate the second derivative of all spectra to compare results

to the results of normalization and standardization. With FTIR data it is not uncommon

to differentiate spectra to the 1st or 2

nd derivative as it improves the separation of non-

resolved peaks.(113) Second derivative analyses were employed to reduce any scatter

effects and to extract any latent spectroscopic information that may be hidden in the

broadened background. Matlab software provides a derivative function which was used

to determine 2nd

derivatives. The plots in Figure 3.7 show the 2nd

derivatives of the

normalized 84 proteins samples in H2O.

1600 1610 1620 1630 1640 1650 1660 1670 1680 1690 1700-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

inte

nsity

Wavemumber (cm-1

)



Figure 2.7. Second derivative on the normalized data of the ATR-FTIR spectra of 84

proteins collected in H2O

.Table 2.1. A table of pre-processing methods and their minimum and maximum values of

spectral intensities.

Pre-processing Method Min Value Max Value

Normalization 0 1

MSC on Raw Spectra -0.0743 0.2645

MSC on Normalized Spectra -0.1956 1.4587

SNV on Raw Spectra -2.3510 2.1815

SNV on Normalized Spectra -1.7878 2.2367

2nd

Derivative -0.0471 0.0494

1620 1630 1640 1650 1660 1670 1680 1690 1700-0.1

-0.05

0

0.05

0.1

d2A

/dv

2

Wavenumber (cm-1

)

EXPERIMENTAL METHOD

70

2.5 Discussions

Although it is common for pre-processing methods to be used before multivariate

modelling, there is no set standard approach and the results of multivariate analysis are

affected by the methods used. This also means that data pre-processing is often revisited

to ascertain optimal solutions before performing any computational analysis.

The ATR-FTIR technique is highly popular, but it introduces some spectral distortions

that are problematic when comparing spectra of ATR and transmission(114, 115). The

ATR correction does improve the similarity between the transmission and ATR spectra,

but the correction is far from perfect. Corrected ATR-FTIR spectra will be used for

both quantitative and qualitative analysis in this project for compatibility purposes.

One common method of pre-processing used on spectroscopic data is normalization

which involves setting each observation (spectrum) to have unit total intensity by

expressing each data point as a fraction of the total spectral integral. This is referred to

as normalization to a constant sum.(88) Normalization helps in reducing the variations

in data from different scaling and makes apparent the peak intensities that are the

discriminating features. Standardization, which is also commonly used, is performed on

the columns of spectral intensities across all samples, giving each column of the data a

zero mean and a standard deviation of 1. The weighting of standardization reflects the

data correlation and is performed so that Multivariate Analysis, such as Principal

Component Analysis (PCA) and Partial Least Squares (PLS), components have the

centre of the data as their origin, which results in the use of a minimal number of

components in a model.(116)



From the above plots, MSC operations on the original dataset tightened and reduced the

intensities of the data while SNV centred the data (to give a covariance) and scaled at

the same time to give a correlation of the samples in the dataset. Comparing the SNV

plots in Figure 2.6 a/b to the plot of the normalized spectra in Figure 2. it is obvious that

SNV gives tighter groupings of the measurements and a better attempt at peak

separation. In fact, SNV has been shown to do better with infrared spectroscopic data of

this sort;(65) however, the normalized plot keep the spectra as close to its original form.

Again, comparison of the modelling results from these methods will decide on what pre-

processing method(s) to use.

Pre-processing methods employed by an analyst would depend on the intended purpose,

for example, the biological information to be extracted, or looking for the variations in

the data.(116) Whereas different analysts use one method or another, a combination of

the two methods may be useful for data pre-processing to correct multiplicative scatter

effects and aid in the extraction of important features through multivariate analysis.

Normalization also allows for the identification of biological samples and for the

analysis of the most contributing factors for the variations in the data. This can be done

by plotting and studying the loadings of PCA which is done in the next chapter. Due to

the use of PCA to study structural variations in protein, normalization may be used for

in Chapter 3 unless SNV or MSC give better results. The use of 2nd

derivative helps in

resolving the underlying peaks of the spectra to aid in the quantification of the data, so

this method will be used. Not only does normalization help with biological

identification of samples, the relationship of the variables in the spectra can also be

extracted whereby helping with quantification of the data. As can be seen by Figure

2.4, normalization (range scaling) also keeps the data close to their original form as

EXPERIMENTAL METHOD

72

opposed to SNV and MSC but SNV and MSC centred the data better than

Normalization. The data had already been baseline corrected which MSC and SNV are

good for; however, these three methods will be tested in Chapter 4 which deals with

quantitative analysis as performance of pre-processing method is not easy to analyse

prior to modelling.(91)



3 PCA Protein Structure Classification

3.1 Introduction

Fourier Transform Infrared (FTIR) spectral data are multivariate (covering hundreds of

wavenumbers), and collinear,(69, 86) meaning that every data point in a protein

spectrum is highly influenced by the next data point due to overlapping absorption

bands of biomolecules caused by spectral shift. (117, 118) In addition to the spectra of

protein samples being heavily influenced by the covariance between biological

components, the differences between FTIR spectra are often subtle making it potentially

difficult to distinguish proteins from each another, particularly when they have similar

structures. To be able to extract meaningful information from these spectra,

multivariate analysis can be applied to correctly classify the spectra based on the

differences in their structural content (119) and also to identify the spectral features that

contribute the most to the spectral differentiation.(120, 121) The usefulness of the

identification of protein structural features and their FTIR marker bands is in the

determination of the structure and the quality (aggregation, degradation, impurities, etc.)

of the protein samples, and also for the identification of unknown samples.

This main aim of this chapter is to explore the existing protein classes in terms of their

secondary structure groups and tertiary folds using spectra measured with ATR-FTIR

spectroscopy (spectral measurements and pre-processing are explain in Chapter 2) and

to create a model with which to qualitatively analyse protein samples to determine their

structures, and this is made possible by applying Principal Component Analysis (PCA).

PROTEIN STRUCTURE CLASSIFICATION

74

The two widely used protein classification systems are CATH(17) (Class, Architecture,

Topology, Homologous superfamily) and SCOP(18) (Structural Classification of

Proteins) and the protein classification assignment used by both systems is based mainly

on X-ray crystallographic measurements. CATH divides proteins into 4 classes (mainly

α, mainly β, αβ, and few secondary structures), while SCOP divides them into 11

protein classes of which 10 were covered by the protein selection in this dataset. These

classes are: All α, All β, α/β, α+β, multi domain (αβ), membrane and cell surface, Small,

Coiled-coil, Peptides, and Designed Proteins. Low resolution is the 11th

class type but it

is not covered in this study simply because there was not any protein of this class

investigated in-house. In this chapter, the PCA results for each protein in our dataset

were compared to those of SCOP, the gold standard for structure comparison, to

determine their class distributions based on their FTIR spectra. SCOP also has the

domain definitions used in this research for protein structure identification.

The PCA method has been previously described in Section 1.3.1 as a multivariate

statistical technique used to reduce data dimensions and to objectively explore the

data.(67) The fewer variables in the FTIR data after reduction of redundant information

make it easier for the data to be explored graphically and statistically; PCA eliminates

data that are not useful in differentiating the protein FTIR spectra, and classifies them as

noise.(68, 122) We have used pre-processing methods, particularly normalization, as

described in Section 2 to remove some linear correlations in the data before PCA

classification of the FTIR dataset. Without normalization of the spectra, the most

intense spectral bands become dominant.(91, 123, 124)

PCA was applied to the Amide I and Amide II regions of ATR-FTIR protein spectra

measured in H2O to perform data reduction, and also to look for variations in the data



and to group the spectra according to their similarities or differences in structure which

can be viewed by the observation and analysis of the plots of principal component

scores and loadings;(125) therefore, protein structures were identified based on the

similarity of the principal component scores of their IR spectra.

3.2 Dataset used for this method

The Amide I region used for this analysis has a total of 101 data points for each

spectrum, in the range of 1700 cm-1

to 1600 cm-1

spaced 1 cm-1

apart. The measured

ATR-FTIR spectra were put into a matrix form; each row of the matrix corresponds to a

spectrum. and each column is the absorbance at a different wavenumber.(126) The

Amide II region from 1600 cm-1

to 1480 cm-1

contains 121 data points; in all, there are

84 protein spectra in this dataset which are listed in Table A2 in Appendix A. The

dataset is summarized below:

Amide I region: 84x101 matrix of soluble Protein Spectra Data in H2O

Amide II region: 84x121 matrix of soluble Protein Spectra Data in H2O

3.3 PCA Model

Sample preparation and ATR-FTIR spectral collection for this study are described in

Sections 2.2 and 2.3. Data points in the ATR-FTIR spectra, as described in Chapter 2,

were first normalized and autoscaled using the Standard Normal Variate (SNV)

approach. This scaling technique takes care of centring each spectrum by subtracting

the mean value of the spectrum from each spectrum value and this gives the spectrum a

new mean of 0; SNV then scales each spectrum by its standard deviation which gives


76

the spectrum a variance of 1; this process is described under ‘Data Processing’ in

Section 1.3.2.2 and Section 2.4.2.

Bands in the Amide I region mostly originate from C=O stretching vibrations and those

in the Amide II region originate mostly from NH bending vibrations. PCA is a pattern

recognition method which does not require a priori knowledge about the protein

structure, and was applied to data for the Amide I and Amide II regions for the 84

proteins in the dataset in order to decompose the FTIR data matrix (X) into scores (T),

and loadings (P). Columns of ‘P’ are values of the calculated covariance (XTX) of ‘X’

and are called the eigenvectors which are the new coordinate axes.(127) The scores,

‘T’, are the new values of the protein samples projected onto each eigenvector. The

principal components (PCs) are the arrangements of eigenvectors according to the

number of variance (non-zero eigenvalues). The sum of the vectors, TPT, represents the

ATR-FTIR data matrix, X. The PCs produced by PCA are ordered by importance to the

data; that is, PC1 has the highest amount of variation, followed by PC2, etc.(126, 128,

129) Principal Component Analysis is covered in the introduction under ‘Multivariate

Analysis’ (Section 1.3.1).

The result of the data decomposition into scores and loadings by PCA is returned in a

matrix of normalized eigenvectors stored in columns (as numbers of loadings ‘P’), and

eigenvalues of each sample (scores ‘T’) stored in rows. The loadings of the principal

components highlight relevant information within the spectra for differentiation. Again,

loadings with no significance are excluded as noise. The first PC, (t1, p1T), gives the

maximum variation in the protein samples, the second principal component is

orthogonal to the first, and describes the maximum of the variation not described by the



first PC, and so on.(127, 130, 131) Each of these components can be expressed in

terms of their original variables as:

pp xbxbxby ...22111

where: y represents the principal components, b represents the calculated coefficients,

x represents columns of the original measured data and p represents the number of

dimensions.(132) The number of PCs that describe the variations in the protein dataset,

with PC1 being the highest, were selected through a PC plot vs. percentage of the

variance explained. In this analysis, the PCs that gave the highest variance in the data

were considered for the study of the secondary structure information of the 84 measured

proteins. These chosen numbers of PCs were also used in plotting the scores and

loadings for the inspection of peaks influential in distinguishing the protein spectra.

The scattered plots of the PC scores are correlated with the loading plot of the same

dataset to view the influence of spectral wavenumbers on the dataset groupings.

3.4 Model Validation

An unsupervised learning method like PCA does not have a reference dataset, as is

found in supervised learning (133). To validate findings from PCA, the dataset can be

split in two before PCA and their results compared for similar samples and groups. If

the scores and the loadings for the two halves are comparable, then the model is valid.

However, in our case, there are also the SCOP structural assignments of these proteins

which can be used to validate results.

Eq. (3.1)


78

3.5 Results and Discussion

Amide I

Figure 3.1 shows the Principal Component plot of the SNV scaled spectra. The plot for

the Amide I region shows that the first PC has the most data variation of about 74%, the

second PC has 10%, etc., giving a total of 95.39% of the variations in the data from the

top 4 components.

Figure 3.1. PC plot of scaled data for the Amide I region of 84 spectra showing percentage

variance of the first 4 principal components. The blue line in the plot indicates the

cumulative percentage variance for all 4 PCs.

Table 3.1 summarizes the size of eigenvalues (as the sum of squares of scores) for each

principal component and the percentage of variance as the sum of all nonzero elements

of the whole matrix. Eigenvalues denoted as ‘E’ are calculated as

where: Ea = sum of squares of scores or eigenvalues per PC, t = scores of each PC and j = the

number of samples.

1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

principal comp

va

ria

nce

exp

lain

ed

(%

)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Eq. ( 3.2)

j

i

iaa tE1

2



Table 3.1. PCA analysis for the Amide I region listing the eigenvalues Sum of Squares,

their percentage variance and the cumulative variance.

Amide I Region

PC E %Vi Cumulative %V

1 7.56 74.32 74.32

2 1.07 10.50 84.82

3 0.82 8.05 92.87

4 0.26 2.52 95.39

5 0.15 1.49 96.89

6 0.11 1.07 97.96

7 0.05 0.44 98.41

8 0.03 0.34 98.74

9 0.03 0.32 99.06

10 0.02 0.23 99.29

PC= number of Principal Component

E = eigenvalues

%Vi = eigenvalues presented as percentage

Cumulative %V = accumulated percentage of V

As mentioned above, a score is the amount of latent variable for a particular sample

(protein). Each sample has a score value for every principal component. Proteins

grouped together are similar in properties, so analysis of the loading plot helped to

identify the discriminatory wavenumbers for the structural bands of the protein spectra.

Figure 3.2 shows the Amide I data projection onto the first two principal components.

PC1 separates the proteins’ spectral bands in the Amide I region into α-helices on the

left and β-sheets on the right.


80

Fig

ure

3.2

. P

CA

sco

res

plo

t fo

r A

mid

e I

regio

n (

1600 c

m-1

– 1

700cm

-1)

of

AT

R-F

TIR

sp

ectr

a of

84

pro

tein

s in

H2O

(P

C1

xP

C2

).

Plo

t sh

ow

s th

e cl

ust

erin

g o

f sc

ore

s.

Blu

e dots

are

pri

mar

ily α

pro

tein

s, r

ed d

ots

are

pri

mar

ily

β p

rote

ins,

gre

en d

ots

in

dic

ate

pro

tein

s o

f o

ther

types

-10

-8-6

-4-2

02

46

810

-3-2-10123

Firs

t Prin

cipa

l Com

pone

nt

Second Principal Component

alph

a ca

sein

d amin

o ac

id d

ehyd

rco

llage

n

avid

in conc

anav

alin

porin

chym

otry

psin

apha

-cry

stal

linIg

G (

bovi

ne)

IgG

(hu

man

)ca

rbon

ic a

nhyd

rase

alph

a am

ylas

e-ty

pe2

tryp

sin(

hog)

tryp

sin

ribnu

S

typs

in-c

hym

oin

vert

ase

lect

in

Pro

tein

ase

-la

ctog

lobu

lin

elas

tin

peps

inal

pha-

1-an

titry

psin

lglu

tath

ion

lact

ic d

ehyd

rog

l-lac

tic

gluc

ose

oxid

ase

stap

hylo

kina

se-B

es

tera

se(b

.sub

stilis

)

insu

lin

myo

glob

in

tran

sket

alos

e

ferr

itin

albu

min

(bo

vine

)

albu

min

(hu

man

)

alph

a lacto

glph

tosy

stem

II

cyto

chro

me

mel

ittin

rhod

opsi

n bl

each

ed

calm

odul

in

pig

citr

ate

synt

hase

anne

xinV

purp

le m

embr

ane

tryp

toph

an

subt

ilisin

phos

vitin

urea

sees

tera

se(R

.ory

zae)

apro

tinin

pa

pain

lipas

e(C

.rug

osa)

nativ

e-an

ti-th

rom

bin

sp

lit-o

valb

umin

nt-o

valb

umin

othe

r



Figure 3.3. Loading Plot of the Amide I (1600 cm-1

– 1700cm-1

) region for spectra of 84

proteins in H2O This Plot shows the clustering of scores. Wavenumbers of spectral signals

indicate the discriminating features associated with clustering of scores in Figure 3.2.

Figure 3.4. PC1, PC2, and PC3 loadings plot of Amide I (1600 cm-1 – 1700cm-1) region for spectra of 84 proteins in H2O (PC1xPC2). The wavenumbers listed indicate the discriminating features associated with clustering of scores in Figure 3.2. Both Figure 3.3 and Figure 3.4 were combined in the analysis of the scores plot (Figure 3.2).

-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

Loading for PC1

Load

ing

for

PC

2

1630

1660

1657

1658

1659

1652

1674

1692

1700 1644

1635

1610

1600 1620 1640 1660 1680 1700-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

Wavenumber (cm-1

)

Inte

nsity

1642

1630

1635

16221609

16521657

1662

1674

1688

1693

Loading for PC1

Loading for PC2

Loading for PC3


82

PC2 shows mostly protein spectra of mixed types (proteins that are not predominantly

α-helical or mostly β–sheet) clustered in the middle. The points to the far left or far

right of the plot are more important in the clustering of the different groups; the points

clustered near (0,0) are not as involved in the segregation of the data, for example,

collagen which lies on the alpha side but closer to the 0.0 origin, is assigned by SCOP

as a ‘designed protein’ and does not have any secondary structure assignment by PDB

because it does not contain any α-helix or β-sheet. Collagen’s location closer to the

middle of the score plot in Figure 3.2 shows that it makes little contribution to the

segregation of the structural types. Proteins such as bacteriorhodopsin (PDBID: 2BRD)

which contains 77% helix and 21% unordered structure (refer to Table A2 in

‘Appendix A’ for a list of protein structural compositions as defined by DSSP) and

myoglobin (PDBID: 1NZ2) with 74% helix and 14% unordered are positioned

farther away from the origin while proteins with a smaller percentage of helix lie

closer to the origin. The β-sheet rich proteins on the right hand side of the plot (shown

in red) are also positioned the same way; proteins with lower percentage of β-sheet lie

closer to the middle of the plot.

To establish what factors are responsible for the separation of these structural classes,

the loading plot of PC2 is employed. The loading plot and the loading spectral plot in

Figures 3.3 and 3.4 show the most important features of the dataset; the most dominant

wavenumbers for PC1 in the Amide I region (Figure 3.3) are shown on the far right for

1630 cm-1

(positive side of PC1) and on the far left for 1657 cm-1

(negative side of

PC1). These findings are in agreement with the vibrational modes assigned in the

literature to α-helix and β-sheets.(134, 135) There is also a broad peak between 1689



and 1697 cm-1

; which may indicate anti-parallel β-sheet as this has been reported in the

literature as registering between these frequencies.(136) The factors are arranged so

that when one factor is high for an observation, the other factor is low for the same

observation (135), therefore scores with more α-helical structure would lie opposite to

those scores with more β-sheet structure. The principal components distinguish between

observations that have high and low frequency value contributions (numbers above and

below zero). The results reported here are in good agreement with literature

assignments for the FTIR bands for both α-helix and β-sheets.(50)

As mentioned earlier, the score plot (Figure 3.2) shows that PC2 highlights proteins

with approximately equal amounts of α-helix and β-sheet closer to the origin and those

with more α-helix or more β-sheet farther away. A loading spectra plot is an alternative

way of identifying the most significant peaks, which may not be readily visible from

just the loading plot alone. PC2 on the loading spectra plot shows a pronounced peak at

1635 cm-1

(in agreement with the spectrum of PC1 in that region for β-sheets) and

around 1672 - 1675 cm-1

, which is mainly associated with β-turns.(50, 117) PC2 also

detected a band near 1653 cm-1

on the farthest negative side of the plot. Scores of PC2

were investigated further as shown in Figure 4.5.


84

-10

-8-6

-4-2

02

46

810

-3-2-10123

First

Princ

ipal C

om

pone

nt


IgG

(hu

man)

ribnu

S ribnu

Aribnu

B

IgG

(bovi

ne)

Pro

tein

ase

cholin

e o

xidase

lact

ic d

ehy

dro

gl la

ctic

glu

cose

oxi

dase

Tra

nske

talo

se

alp

hala

ctogl

staphy

loki

nase

-B

est

era

se(b

.sub

stili

s)

cona

lbum

in

pyr

uvate

Lys

ozy

me

hexa

kina

se

Try

pto

pha

nalp

ha c

ase

in

Split

-ova

lbum

in

lipase

(c.r

ugosa

)

Apro

tinin

est

era

se(r

.ory

zae)

papain

subtil

isin

Peptid

es

/

Mem

br.

and

cell

surf

.

+

Desi

gne

d P

rote

in

Coile

d-c

oil

Sm

all

pro

tein

Fig

ure

3.5

. P

CA

sco

res

of

the

Am

ide

I re

gio

n o

f A

TR

-FT

IR s

pec

tra

of

pro

tein

s in

H2O

. P

lot

hig

hli

gh

ts t

he

clu

ster

s of

mix

ed p

rote

ins;

αβ

pro

tein

s (m

ult

i-d

om

ain

) as

gre

en d

ots

, α

+β

pro

tein

s (m

ainly

anti

-par

alle

l) a

s re

d s

tars

, α

/β p

rote

ins

(mai

nly

par

alle

l) a

s bla

ck s

tars

. T

he

litt

le

unla

bel

led

blu

e an

d r

ed d

ots

are

α-h

elix

and β

-shee

t pro

tein

s hig

hli

ghte

d i

n F

igure

3.2



Additional analysis was made, using Figure 3.5, and it was observed that PC2 does

separate the proteins assigned as α+β class (mainly anti-parallel β-sheets shown with red

stars), away from the α/β (mainly parallel β-sheets represented by black stars) class of

proteins located below the origin (below zero). The αβ multi-domain class proteins

shown in green are clustered more tightly around the middle of the plot. Although we

can see significant spectral features in the loading plot of Figure 3.3 at 1635 cm-1

(upper

right) and 1652 cm-1

(lower left), it is the sharp peaks shown in the PC2 loading spectra

plot of Figure 3.4 that re-emphasize the contribution of these wavenumbers in the

discrimination of the different structural origins of bands within this region. The

contribution of the broadband at 1674 cm-1

to any particular protein structure was not

completely clear; this band is mainly associated with β-turns.(13)

Figure 3.6. Scaled ATR-FTIR spectra of the Amide I region for ribonuclease S in red

(which contains mainly anti-parallel β-sheet), transketalose in blue (mainly parallel β-

sheet), and lactic dehydrogenase in green (contains both parallel and anti-parallel β-sheet)

1600 1610 1620 1630 1640 1650 1660 1670 1680 1690 1700-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

wavenumber in (cm-1

)

absorb

ance

Ribonuclease S

Transketalose

Lactic dehydrogenase


86

Figure 3.7. PC2 loading spectra plot of the ATR-FTIR Amide I region for 84 proteins in

H2O.

Ribonuclease S, represented in red in the above plot (Figure 3.6) is an anti-parallel β-

sheet protein that shows a marker band around 1635 cm-1

region and has 33% β-sheet

of which almost all of it, 31%, is anti-parallel and 1% is mixed-β as assigned by DSSP.

Amide I marker bands of anti-parallel β-sheet proteins at ~1612 cm-1

, 1640 cm-1

and

1690 cm-1

(weak) were reported by Pelton and McLean (117, 137) and the loadings

peak frequencies shown in the PC2 plot (Figure 3.7) are similar to the frequencies

reported by Pelton and McLean. Transketalose, in blue, on the other hand has 2% anti-

parallel and 11% parallel β-sheet and shows a strong band between 1652 - 1655 cm-1

in agreement with the loading plot; parallel β-sheet frequencies have been reported to

span from 1624 cm-1

to 1648 cm-1

.(50) To highlight the effectiveness of the loading

plot, the FTIR spectrum for lactic-dehydrogenase is shown in green; this protein shows

band intensity ~1651 - 1653 cm-1

in agreement with the assignment of bands at these

frequencies to parallel β-sheet, and another band at about 1635 cm-1

- 1640 cm-1

around the anti-parallel β-sheet marker;(23) lactic-dehydrogenase contains 8% parallel

1600 1610 1620 1630 1640 1650 1660 1670 1680 1690 1700-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Wavenumber in (cm-1

)

Inte

nsity

1635

1652

1674

1609

Loading for PC2



and 11% anti-parallel β-sheet as assigned by DSSP. Literature reports on mixed

proteins (including Navea et al.(127) and Dong et al.(136)) have stated that they do not

exhibit a clear spectral feature; however, from our dataset a good distinction can be

seen from the spectral plot in Figure 3.6 with the parallel and ant-parallel β-sheet

structures being clearly separated by their maker bands of 1635 cm-1

and 1652 cm-1

,

respectively.


88

-0.2

-0.1

5-0

.1-0

.05

00.

050.

10.

150.

2-0

.2

-0.1

5

-0.1

-0.0

50

0.050.

1

0.150.

2

1700 1699 1698 16

97 1696 16

95 1694

1693

1692

1691

1690

1689

1688

1687

1686

1685

1684

1683

1682

1681

1680

1679

1678

1677

1676

1675

1674

1673

1672

1671

1670

1669

1668

1667

1666

1665

1664

1663

1662

1661

1660

1659

1658

165716

56165516

5416

5316

5216

5116

5016

4916

48

1647

1646

1645

1644

1643

1642

1641

1640

1639

1638

1637

1636

1635

1634 16

3316321631

1630

1629

1628

1627

1626

1625

1624

1623

1622

1621

1620

1619

1618

1617

1616

1615

1614

1613

1612

1611

1610

1609

1608

1607

1606

1605

1604

1603

1602

1601

1600

PC1

PC2

Fig

ure

3.8

. B

iplo

t (P

C s

core

s an

d l

oad

ings)

of

Am

ide

I re

gio

n;

blu

e li

nes

show

the

len

gth

of

each

lo

adin

g a

nd

th

eir

dir

ecti

ons.

Red

do

ts a

re t

he

pro

tein

sam

ple

sco

res.



The biplot, a plot of PCA scores superimposed on the loadings, is another way of

viewing both loading and scores on the same plot; it reveals clusters and allows the

interpretation of PCA plots. In this biplot (Figure 3.8), the red dots represent the spectra

scores of the protein samples and the blue arrows represent the corresponding loadings.

In a biplot, the length of the blue arrows approximates the variances of the eigenvector.

The longer lines have higher variance in their data,(92) these long lines are the features

with the largest spread in the data. The shorter lines closer to zero indicate that the

values of the features are similar. Bands from 1626 to 1634cm-1

seem to contribute

more to the β-sheet rich proteins with the longest length at 1631 cm -1

; bands from 1650

cm-1

to 1660 cm-1

(1657 cm -1

shows more significance) have contributing factors for α-

helical rich proteins. As can be seen in the loading plot of Figure 3.3, frequencies

between 1670 -1674 cm -1

, with 1674 cm-1

having more emphasis (longer length),

contribute strongly to proteins not primarily containing significant quantities of α-helix

or β-sheet structure. The biplot does not allow the arguments for labelling both score

and loadings together on the same plot; elastin is labelled as a guide to show how the

plots are super-imposed (see Figure 3.2 for more score labels).

Amide II

The Amide II PC plot (Figure 3.9) quite expectedly shows the first PC as having 64% of

the variation in the data. Figure 3.10 shows -helix on the lower left, -sheet on the

upper right of the figure and mixed proteins clustered in the middle of the plot. On a

close inspection, it appears that it is PC2, and not PC1, in the loading plot (Figure 3.11)

that shows a strong peak correlation for α-helix at 1546 cm-1

and for β-sheet at 1523 cm-

1. PC1 shows a spectral region on the right that span from 1515

- 1526 cm

-1 (positive

side of PC1) and on the left from 1555 - 1576 cm-1

(negative side of PC1) of the loading


90

plot. Although PC2 (from Table 3.2) shows a percentage variance of only 17%, the

discriminating features of the Amide II PC2 loading are in agreement with literature

reports on the FTIR marker bands for α-helix and β-sheet for the Amide II regions of

FTIR spectra.(138) This can be further viewed in the loading spectra plot (Figure 3.12)

below.

Figure 3.9. PC plot of Scaled Amide II region of 84 spectra showing percentage variance

of the first 4 principal components. The blue curve indicates the cumulative variation of all

4 components



Table 3.2. PCA Table shows the eigenvalues Sum of Squares, their percentage variance

and the cumulative variance for the Amide II region.

Amide II Region

PC E %Vi Cumulative %V

1 5.26 64.32 64.32

2 1.37 16.78 81.10

3 0.94 11.46 92.56

4 0.23 2.86 95.42

5 0.16 1.93 97.35

6 0.07 0.91 98.27

7 0.03 0.43 98.69

8 0.02 0.27 98.97

9 0.02 0.23 99.20

10 0.02 0.19 99.39

PC= number of Principal Component

E = sum of squares of scores or eigenvalues per PC;

%Vi = eigenvalues presented as percentage,

Cumulative %V = accumulated percentage of V.


92

-10

-8-6

-4-2

02

46

810

-4 -3 -2 -1 0 1 2 3 4

First

Princip

al C

om

ponent


annexi

nV

calm

odulin m

elit

tin

ferr

itin

rhodopsin

unble

ached

lysozym

em

yoglo

bin

hexa

kin

ase

porin phosvitin

nt-

anti-c

hym

otr

yp

pig

citra

te s

ynth

ase

alc

h-d

ehydro

g

alb

um

in (

sheep)

pero

xidase

phto

syste

m I

I ela

stin

lglu

tath

ion

concanavalin

avid

inapha-c

rysta

llin

chym

otr

ypsin

lactic d

ehydro

genase

pro

tein

ase

carb

onic

anhydra

se

trypsin

(hog)

trypsin

ogen

pepsin

ubiq

uitin

glu

cose o

xidase

split

- c1 inhib

itor

este

rase(b

.substilis

)

trypsin

transketa

lose

reaction c

ente

re R

. S

phae.

alp

ha c

asein

lipase w

heat

oth

er

Fig

ure

3.1

0. S

core

s plo

t fo

r A

mid

e II

reg

ion (

1480 c

m-1

– 1

600cm

-1)

of

spec

tra

of

pro

tein

s in

H2O

(P

C1

xP

C2

).

Plo

t sh

ow

s th

e cl

ust

erin

g o

f

score

s. B

lue

dots

are

pri

mar

ily α

-hel

ix r

ich

pro

tein

s, r

ed d

ots

are

pri

mar

ily β

-shee

t ri

ch pro

tein

s, a

nd

gre

en d

ots

in

dic

ate

pro

tein

s of

oth

er

stru

ctura

l ty

pes



Figure 3.11. Loading Plot of the Amide II (1480 – 1600cm-1

) region of spectra for 84

proteins in H2O (PC1xPC2). Plot shows the clustering of scores. Wavenumbers in the

plot, especially the extreme right and left numbers indicate the discriminating

wavenumbers associated with clustering of scores in Figure 3.10.

Figure 3.12. PC1 and PC2 loadings spectra plot of Amide II (1480 cm-1

– 1600cm-1

) region of

spectra of 84 proteins in H2O. Wavenumbers in plot indicate the discriminating wavenumbers

associated with clustering of scores in Table 3.10. Both Figure 3.11 and Figure 3.12 were

combined in the analysis of the score plot.

-0.2-0.15-0.1-0.0500.050.10.150.2

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

Loading for PC1

Loadin

g f

or P

C2

1553

1547

1523157215691567

1486

15501551

1555

15061583

1541

1480 1500 1520 1540 1560 1580 1600-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

Wavenumber in (cm-1

)

Intensity

1522

1546

15671526

1571

1488

Loading for PC1

Loading for PC2


94

3.6 Conclusion

ATR-FTIR spectroscopy is a useful tool for the analysis of proteins structure. Protein

amide vibrations are intrinsic to the backbone of the protein’s secondary structure. The

exact band position of the FTIR spectra is determined by the backbone conformation

and the hydrogen bonding pattern involved.(139) Another major contributing factor for

the amide vibration sensitivity to secondary structure is the transition dipole coupling of

the amide groups(the N-H bond) and their relative orientation and distance from one

another; this causes each secondary structure to absorb predominantly in a specific

spectral region helping to distinguish α-helix from β-sheet, etc.(140) The α-helix in the

Amide I absorption band is known to show up between 1655-1648 cm-1

while β-sheet

shows a strong band between 1648-1620 cm-1

.(50) However, the assignment of a band

to a secondary structure type is not straightforward because the bands largely overlap

and a broad band is observed as always seen in the amide I region; these bands can be

determined computationally by methods like curve fitting and multivariate analysis.

The analysis of the distribution of spectral signatures requires complex multivariate tool

like Principal Component Analysis (PCA).(119) Although protein spectra often look

similar, PCA clearly discriminates even structurally similar samples from each other as

was shown in the score plots. In our analysis we have investigated α+β and α/β SCOP

classes usually not considered for protein classification, and the results obtained were

significant.

PCA, an unsupervised pattern recognition technique, is a well-known method of

dimension reduction and data exploration. The basic principle of PCA is to reduce the

dimensionality of a data set, while retaining the variations present in the original

predictor variables (141). Due to the reduction in data, it is possible to graphically



explore the data correlation and understand the biological differences between protein

samples. Understanding these structural differences in proteins is important for the

proper classification of their biological differences.(142-144) With the help of methods

like PCA, analyses and visualization of the distribution of domains in a three

dimensional structure can be performed, so as to retrieve the important factors related to

protein function.(19, 143, 145)

The structural information in the protein spectra was retrieved primarily from the Amide

I bands, which arises chiefly from the C=O stretching vibrations between 1600 cm-1

and 1700 cm-1

, This region is used by spectroscopists for the study of proteins because

it contains a strong band frequency that is correlated to the secondary structure of

proteins.(134) The Amide I region has been used extensively for the study of protein

secondary structure because it presents the vibrational modes of different secondary

structures (α-helix, β-sheets, β-turns, coils etc.).(127) We have shown that ATR-FTIR

spectroscopy and PCA analysis can be combined successfully in the classification of

protein structure. In order to determine the identity of proteins, the score and the

loading plots of PCA analysis were employed as shown in Figure 3.2. The plot of PC1

versus PC2 plot revealed clusters of α-helix, β-sheets, α/β proteins, and α+β protein.

The loadings plot and loading spectra plots indicate that the most useful peaks in

distinguishing these structural regions, that is, the wavenumbers influence on each

spectrum makes it possible to identify the different classes and patterns of secondary

structures in the protein samples.

It can be shown that PCA loadings of ATR-FTIR spectra provide a basis for the

underlying differences between the identified protein structures (119, 146, 147), making


96

it possible to disclose areas of the FTIR spectra that contribute the most to the variance

of the data in the Amide I and Amide II regions. The score plots provided by this PCA

analysis can be used as input to other classification methods; it can also be used to

check the structural grouping of unknown proteins before applying an appropriate

model to quantitatively determine their secondary structure compositions(127, 147-

149). Although PCA explores the structural relationships between data, it cannot be

used for the quantitative analysis of data, for this a different method called Partial Least

Squares is introduced in Chapter 4.

Table 3.3 summarizes the α-helix and β-sheet bands reported by other literature and

how the results from this analysis compare. Except for α/β, the other structure types are

well within the wavenumber ranges reported by the literature.

Table 3.3. FTIR marker bands of protein classes in the Amide I region.

Arrondo, Goni(50) Pelton, Mclean(137) This Work

Sec. Struct. Assign Infrared frequencies in cm -1

α-helix 1655-1648 1655-1650 1657

β-sheet 1648-1620

1693 (weak)

1640-1612 1685(weak) 1630

α+β 1690; 1630 1640-1612

1690-1670

1635

1674

α/β 1648; 1630 1640-1626 1653



4 PLS Regression for Protein Secondary Structure Determination

4.1 Introduction

The high quality of Fourier-Transform Infrared (FTIR) spectra has been used by many

spectroscopists for the analysis of protein secondary structure.(42, 150-154) It has

become a powerful tool in examining the conformation of proteins in H2O as well as in

D2O.(38) FTIR can rapidly obtain the biochemical fingerprint of samples under study

and give significant information on their biomolecule contents.(42, 155) Section 1.2

explained the importance of FTIR spectroscopy in more detail. This section will address

the quantitative determination of the secondary structural information present in the

infrared spectra of proteins. To determine the protein secondary structures from their

spectra, multivariate regression analysis methods such as Partial Least Squares (PLS)

and Principal Component Regression (PCR) were applied to the dataset of 84 FTIR

protein multivariate spectra with 16 proteins also measured in D2O, with the spectra

being collected and pre-processed as described in the ‘Pre-processing Method’ in

Chapter 2.

The Amide I region of the FTIR spectra is normally used in the analysis protein

secondary structure because the frequencies of the Amide I modes, which are due to

C=O stretching vibration, are known to correlate closely to a protein’s secondary

structure elements;(13, 156) however, the Amide I bands of proteins display extensive

overlaps of the underlying component bands of α-helix, β-sheet, β-turns and coils,

which are not instrumentally resolvable because they lie in too close proximity to each

other.(36) Therefore, mathematical methods, such as second derivatives, are often used

MULTIVARIATE ANALYSIS OF pH EFFECT 0N α-LACTALBUMIN

98

to enhance and resolve the individual band components corresponding to specific

secondary structures.(13) The application of 2nd

derivatives and pre-processing of the

proteins was described in Chapter 2.

The Amide I band components can be assigned by studying their frequency behaviour

in which the protein secondary structure is known by other techniques such as X-ray

crystallography.(157) A calibration model is determined from a set of samples, and the

X-ray known protein content and the resulting model is used to predict the content of

new, unknown samples from their spectra. In this chapter, PLS/PCR multivariate

calibration (using many variables simultaneously) was used to quantify new variables

‘y’ from the matrix of the FTIR measured protein spectra X, via a mathematical

function of some kind.(89) PLS and PCR solve the near multi-collinearity often found

in spectral measurements. This phenomenon has been explained in Chapter 1, it means

when two or more independent (predictor) variables are correlated and so provide

redundant information from the model. FTIR spectra are multivariate and collinear in

nature, therefore, the need for dimension reduction which is achieved by PLS

regression. The Capital letters ‘X’ is used to represent the entire matrix of protein

spectra while capital ‘Y’ represents a matrix of the dependent set, that is, the entire

secondary structure motif from X-ray crystallography. Small letters ‘x’ and ‘y’ are used

to represent a single vector from these matrices.

The PCR method is based on the basic concept of principal component analysis (PCA).

PCR performs data decomposition into loading and score variables used in building a

model. For PCR, the estimated scores matrix consists of the most dominating principal

components of X. These components are linear combinations of X measurements



determined by their ability to account for the variability in X.(158) The first principal

component is computed as the linear combination of the original X-variables with the

highest possible variance; PCR uses only the X variables for the analysis without

employing the y variable. On the other hand, PLS regression, which is similar to PCR,

can give good prediction results with fewer components than PCR because the response

variable is employed in the regression as well.(159) For detailed information on PLS

methodology, see Chapter 1, Multivariate Regression section. A consequence of this is

that the number of components needed for interpreting the information in X (measured

spectra) which is related to ‘y’ (reference secondary structure) is smaller for PLS than

for PCR. This may, in some cases, lead to simpler interpretation. Using the optimal

number of components in each case, however, the two methods often give comparable

prediction results.(13)

D2O has lower absorption around 1643-1650 cm-1

in the Amide I region which is where

proteins in H2O absorb water vapour;(117) however, H2O is preferred as a solvent for

the study of proteins because unlike D2O it does not modify the structure of

proteins.(53) Cell fluid is made up of high water content and water is less expensive to

work with. The proteins in D2O were analysed for their secondary content to compare

and evaluate the results produced from the regression method employed.

4.2 Protein dataset used for this method

The dataset of proteins used for this study has been described in the Chapter 3 which

covered PCA; this was also covered in Chapter 2 which explained the pre-processing

done for this project.. The dataset represents a wide range of helix and sheet fractional

content values and other structural contents such as 310-helix, turns and unordered


100

structure as described and determined by the Dictionary of Protein Secondary Structure

(DSSP) (109). The selection of these protein samples was based on criteria such as

crystal structure quality and the distribution of their secondary structures,

conformational classes as described by the Structural Classification of Protein (SCOP)

system,(18) and protein solubility and stability. These 84 proteins measured in H2O and

16 selected proteins measured in D2O are listed in Tables 1a-f and 2a-c in Appendix-B.

All spectra were pre-treated by means of intensity normalization and SNV (Standard

Normal Variate) as described in Chapter 1, section 1.3.

Data Selection Scheme 4.2.1

Models generated from the PLS/PCR methods were used for the prediction of the

structural compositions of new samples; therefore, as many possible biochemical

compositions of protein samples and types of protein structure that could be

encountered later should be represented in the calibration set.(160) From the ‘X’ and

‘y’ matrices of the calibration set of proteins in H2O, 60 spectra were used to develop

the partial least squares regression model. The remaining 24 were used as the

independent test set, set aside for predictive accuracy. Of the training set, 20 were

included in the test set for validation accuracy, totalling 44 samples in the test set. This

selection was done by an algorithm called Kennard Stone (KS).(161) KS is a well-

known method for the selection of a sample subset for calibration. It chooses objects

that are uniformly distributed in the X-matrix (the measured spectra) by assigning a

sample to the calibration set, closest to the mean of the entire sample; the next sample is

then chosen based on the square distance to the sample already assigned.(162) The

sample furthest from the already selected sample is added to the calibration set, etc. In

this way, variations of the dataset are chosen up front. The algorithm was set to select



the first 60 samples representing the dataset, with the rest left for validation. The

calibration model was constructed and subsequently validated. Figure 4.1 represents

the KS selections split into calibration set and test set.

Figure 4.1. Schematic representation of samples used in the training set and the test set

after application of the Kennard Stone selection algorithm. Matrix X represents the FTIR

spectra while vector ‘y’ represents the percentage fraction of structure determined by DSSP

analysis from X-ray crystallographic data. This figure represents the strategy for the 84

protein samples measured in H2O.

4.3 PLS Modelling

The PLS model is a mathematical representation of a complicated system that holds

within it the information of the contents of that system,(89, 155, 163, 164) in this case,

protein secondary structure. Sample preparation and ATR-FTIR spectral collection for

this experiment are described in sections 2.2 and 2.3. Data points in the ATR-FTIR

40 proteins for calibration set

24 proteins to test set

1

2

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. 84

20 used in both

60 proteins

44 proteins

40

24

20

Matrix X Vector y


102

spectra, as described in section 3.2, were first normalized before calculating the second

derivative which is done in order to enhance the separation of the underlying bands.

As in the PCA section, the Amide I region, whose bands mostly originates from C=O

stretching vibrations, and Amide II region, mostly arising from NH bending vibrations,

were used for this analysis. The Amide III region was not recorded in many of the

protein spectra used for this study; therefore this region was not analysed here.

The FTIR spectra were put into a matrix form containing 84 rows made of the measured

protein spectral intensities and 101 columns made up of the spectral wavenumbers in

the Amide I region and a separate matrix form of 124 columns for the Amide II region

of the spectra for Amide II analysis. The algorithm for PLS was written using Matlab

R2010 and Matlab’s PLSregression. PLS requires a priori knowledge about the protein

structural groups; fractions of the proteins secondary structure (SS) motifs have been

determined by X-ray crystallography and files of their atomic features obtained from the

Protein Data Bank (PDB)(137, 157, 165) were fed into a DSSP script, output from

DSSP contains the proteins’ secondary structure information which was put into a

matrix form labelled as ‘y’ These SS fractions are listed in Table A2, Appendix A.

PLS regression was applied to determine secondary structure contents using the Amide I

regions (1700-1600 cm-1

region) and Amide II bands (1600–1480 cm 1) of the spectra

separately and to the Amide I & II regions together to determine and compare their

secondary structural compositions to that determined by X-ray crystallography. PLS

finds the optimal linear combination of the variables in the obtained FTIR spectra of

protein samples (X variables) and these linear combination variables, called components



but also known as latent variables (LV)(105), were used to determine the regression

equation (2.2). The general equation notation is:

Y = Xb + ε, where eq. (4.1)

Y is the matrix of the predicted latent variables, X is the predictor latent variable,

b is the determined coefficients during calibration, and ε is the residual error (the

difference between the observed and predicted dependent variables).

4.4 Results and Discussions

A plot of the ‘percentage of y explained variance’ based on chosen components, as seen

in Figure 4.2, was obtained. The diagnostics from this plot were used to choose a

different model; that is, based on the plotted percentage y explained variance, a

minimum number of components that gave rise to the maximum explained variance was

chosen for each analysed region of the spectra. PLS extracts components in order of

their relevance to the structure type in question.(127, 152) From Figure 4.2 and Figure

4.3 for α-helix, we see that after 10 components the gap between points got smaller and

the line curvature begins to flatten. Any value between 6 and 10 could have been

chosen for the model, validation of this model was done later to eliminate overfitting

issues. 10 components for the analysis α-helix from the Amide I region was selected as

the maximum explanation of variance and a low Root Mean Square Error for

Calibration (RMSEC).


104

Figure 4.2. Explained Y variance of α-helix for Amide I region; explained variance is

about 96% for 10 components.

Figure 4.3. Plot shows Root Mean Square Error (RMSE) vs. number of components for α-

helix of the Amide I region. 10 components gave low root mean square error and was

chosen to fit the data.

0 5 10 150

20

40

60

80

100

Number of PLS components

% V

aria

nce E

xpla

ined in

y

0 5 10 150

0.01

0.02

0.03

0.04

0.05


Root M

ean S

quare E

rror



Figure 4.4 and Figure 4.5 also show that 10 components for analysis of β-sheet content

from the Amide I gave the maximum explanation of variance at a low Root Mean

Square Error for Calibration (RMSEC) of 0.020. This is the maximum number of

components before the decrease in the RMSEC is negligible. It is assumed that this

number will fit new samples, however, a cross-validation was later performed to

eliminate the need to keep adding unnecessary PLS components to the model. Looking

at the RMSEC plot, anywhere between 7-10 components would be suitable for

providing a good estimate.

Figure 4.4. Explained Y variance of β-sheet for the Amide I region is about 95 % for 10

components.

0 5 10 1540

50

60

70

80

90

100


% V

aria

nce E

xpla

ined in y


106

Figure 4.5. Root Mean Square Error for β-sheet analysis for the Amide I region.

Model Validation 4.4.1

PLS is sensitive to overfitting, that is, the RMSEC will continue to decrease for

additional PLS factors (components) chosen for the model; however, the Root Mean

Square Error for Prediction (RMSEP) will continue to increase due to overfitting.(99)

For this reason cross-validation of the model is carried out to ensure that obtained

results are not from random noise. The test samples left out of the calibration set were

used as the validation set. The predicted ‘y’ values of these new samples are compared

to that of the true ‘y’ values and the predictive ability of the model was then calculated.

Cross-Validation 4.4.2

Models were cross-validated and an optimal number of latent variables, indicated by a

minimum root mean square error of cross-validation (RMSECV),was selected to avoid

overfitting.(100-102) For PLS, this error measurement, which is the standard deviation

0 5 10 150

0.005

0.01

0.015

0.02

0.025


Ro

ot

Me

an

Sq

uare

Erro

r



of the unexplained variance of the cross-validation, is a much better fit than RMSEC

because RMSEC does not indicate when the model is overfitted.

Overfitting occurs when too many components are selected to fit the model as this can

lead to much redundancy in the ‘x’ variables resulting in a higher estimation error due

to data dependency. Cross-validation (CV) is used to choose the optimal number of

components to use for the model and to check on overfitting by not reusing the same

data for calibration and prediction error.(166) With a “Leave One Out” cross-

validation, one data item at a time from the n samples is left out of fitting (n-1), and

error estimation and calibration is performed on the rest of the data. The deleted data

are put back for calibration and another sample is taken out.(100) These samples left

out are used as test samples. In this method a “CV” option was given to the MATLAB

PLS function with 10-fold (segments) specified. "CV" divides the randomly selected

data into 10 segments and each segment was left out one at a time. Plots of the

validation results are used to visualize the number of components that give the lowest

Root Mean Square Error for Cross Validation (RMSECV). Figure 4.6 shows that the

optimum number of components, 4, was enough to fit the spectra of new samples.

The RMSECV (see eq. 4.2) must be interpreted in a slightly different way to Root Mean

Square Error of the Prediction (RMSEP). Note, in particular, that it is not an error

estimate of an actual predictor with already computed regression coefficients. Cross-

validation provides, instead, an estimate of the average prediction error of calibration

equations based on N-1 samples drawn from the actual population of samples.

Eq. (4.2)


108

RMSECV = √∑(𝑦𝑝𝑟𝑒𝑑

𝑐𝑣,𝑖 −𝑦𝑜𝑏𝑠𝑖 )

2

𝑛−1

𝑛𝑖=1 eq. (4.2)

where n is the number of proteins in the dataset, ypred is the predicted Figure and yobs is

the observed value from X-ray crystallography.

Figure 4.6. CV cross-validation: 4 PLS components only are required to fit the α-helix

model for proteins in H2O for the Amide I region.

Independent Test Set 4.4.3

Cross-validation does an internal test with the calibration model; therefore, it is always

useful to test out a model against an independent test set. For the model that has already

been established, the prediction ability, and not the fitting ability, indicates how well the

model can predict new samples that have not been seen by the model.(167, 168) The

best way to perform this validation is by using the 44 samples in the test set not

previously used. By predicting the ‘y’ values of the samples in the test set and then

comparing the results with the true values, an estimate of the predictive ability of the

0 5 10 15 200.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16


RM

SE

CV



model is obtained. Validation testing shows that the secondary structure content of 44

proteins is predicted with only 3 components and a RMSEP of 0.034. The Root Mean

Square Error of Prediction (RMSEP) of the independent test set is obtained for the

judgment of the performance of the calibration set.(99, 169-172) The results of the

RMSEP obtained from this PLS model are shown in Table 4.1 below.

Coefficient of Determination (R2) 4.4.4

The fitted response values are then computed by the PLS model and an R2 value is

generated. The R2 value is a measurement of how well the regression line fits the

measured data points.(173) An R2 of 1 denotes that the regression line fits the data

perfectly and an R2 of 0 means there is no correlation between the explained variance in

the predicted model and the total variance of the data. This variability is measured as

the residual sum of squares (SSR over the total sum of squares (SST). Eq. 4.3 denotes

the formula for R2.

R2 = 1 − (SSR

SST) = ∑

(𝑦−𝑦𝑓𝑖𝑡)2

(𝑦−𝑦𝑚𝑒𝑎𝑛)2

𝑛

𝑖=1

Residual values can also be viewed with a residual plot of the estimated and true ‘y’

values. Figure 4.7 shows the residual plot of α in the Amide I region. The residuals

appear to be randomly scattered around zero which shows that the model fits the data

well. From the residual plot it can be seen that there are no obvious outliers. The

majority of the residuals for the samples in the calibration set are between ±0.05 except

for very few which were still below 0.2. The R2 and RMSEP values were used to

evaluate the prediction of accuracy for these samples.(153)

Eq. (4.3)


110

Figure 4.7. Residual plot of the 60 samples in the Amide I region. Error levels between

the fit and the individual data are shown for each sample. Note: The residuals are not

sorted by protein but by the order in which they were analysed.

PCR 4.4.5

The major difference between PLS and PCR is that PCR uses only the X variables for

the analysis without employing the y variable, although the y response variable is

needed for the calculation of the residual. A comparison of PLS and PCR was made to

see which method works best for the analysis of protein secondary structure.

For our data calibration, PLS and PCR cross-validation were plotted for comparison.

This plot is shown in Figure 4.8. The plot suggests that as few as 3 PLS components

may be adequate to interpret the information in the predictor (X) which is related to the

response (y), but 5 PCR components were required. PCR constructs components to

0 10 20 30 40 50 60-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Observation

Resid

ual

albumin (sheep)

glucose oxidase (aspegillus niger)Aprotinin (bovine lung)

thaumactin (t.danielli)



explain the data in the calibration set without much regard to the reference set values

(y), while PLS puts the reference set values into consideration while fitting, and, as a

result, PCR requires more components to fit the data. As the number of components

increases, the two methods often give similar prediction results or PCR may do better.

As seen in Figure 4.8, it appears that the PLS cross-validation (PLS-CV) technique is

more realistic to test the predictive ability. This is not always the case for all secondary

structure results in this analysis. See Section 4.4.8 and Tables 4.1a/b.

Figure 4.8. Performance of CV cross-validation: 4 PLS components is optimal for PLS-

CV while 5 components is optimal for PCR-CV (Amide I region of FTIR spectra).

PLS with 2nd

Derivative 4.4.6

Second derivative spectra did better with fewer components in the calibration of the

entire spectra (combined Amide I and II regions). The plotted observed vs. fitted

response in Figure 4.9 for proteins in H2O solution shows that the regression line fits the

data very well for α-helix and β-sheet from bands in the Amide I region. Datasets

1 2 3 4 5 6 7 8 9 10 110.01

0.02

0.03

0.04

0.05

0.06

Number of CV components

RM

SE

CV

PLS-crossvalidation

PCR-crossvalidation


112

computed from the second derivatives of the FTIR spectra were used for secondary

structure analysis with a PLS calibration. The R2 and RMSECV values for each

calibration were used in choosing the best fit model in the spectral range of α-helix, β-

sheet, parallel β-sheet, anti-parallel β-sheet, β-turns and “other”. These results can be

seen in the highlighted rows of Table 4.1.

The PLS analysis of the 2nd

derivative of Amide I data used 10 components for the

prediction of α-helix content, giving an R2 of 96% with a low RMSEC of 0.040 which,

after cross-validation, gave an RMSECV value of 0.185 with only 4 components. This

is an ideal number of PLS components for the final model because cross-validation

choses optimal components that ensure no overfitting of the model. This number of

components was found by the low values of the estimated RMSECV and was used to

predict α-helix in the test set, giving a RMSEP of 0.120 with an R2 of 71%. The fit of

the data in the test set for α-helix prediction from the Amide I region is plotted (see

Figure 4.10) and the predicted result of each individual sample can be seen in Tables

B1-B6 in Appendix B. Figures 4.11 and 4.12 show the fitted vs observed plots for the

β-sheet calibration and test sets of data from the Amide I region of the FTIR protein

spectra using the same method.



Figure 4.9. The plot of the calibration set for the analysis of α-helix content from Amide

I region : This plot shows PLS fitted vs response (X-ray crystallography values) using the

FTIR spectra of 60 proteins in H2O for calibration. The red line of fit going through the

origin emphasises the positive correlation in the data.

Figure 4.10. Test set plot for analysis of α-helix from the Amide I data: PLS fitted vs

response on 44 proteins in H2O used for testing the model. The model shows the

relationship between the crystal structure values and the calculated values of α-helix from

FTIR spectra for these proteins.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-helix train set values

predic

ted valu

es

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-helix test set values

predic

ted valu

es


114

Figure 4.11. Plot shows PLS prediction for β-sheet vs. X-ray crystallography vales used in

the training of the Amide I region. The training set had FTIR spectra of 60 proteins in

H2O. The red line of fit going through the origin emphasises the positive correlation in the

data.

Figure 4.12. Test set plot of Amide I β-sheet: PLS fitted vs response on 44 proteins in

H2O used for testing the model.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-sheet train set values

predic

ted valu

es

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

-sheet test set values

predic

ted valu

es



Eq. (4.4)

Calibration of each of these secondary structural contents was done for both the entire

Amide I and Amide II regions separately and for the combined Amide I and II regions

together, as is shown in Table 4.1. Cross-validation R2 results of 0.88 (α-helix), 0.94

(β-sheet), 0.44 (parallel β-sheet), 0.93 (antiparallel β-sheet), 0.44 (mixed β-sheet), 0.53

(310-helix), 0.60 (turns), and 0.62 (other) for the Amide I region indicates the percentage

explained variance of each predicted structure. The RMSECV must be interpreted

differently, as it is an estimate of the average internal prediction error whereas RMSEC

is an error estimate of the calibration. Note that the RMSECV values for α-helix

(0.185) and β-sheet (0.111) are higher than the ones for parallel β-sheet (0.052), 310-

helix (0.037), and turns (0.051). This is not because the predictive values of the later

are better, but rather that, in our samples, the variations of parallel β sheet, 310-helix and

turns are small with an average of 10% 310-helix content and 3% parallel β- sheet for the

measured proteins. To assess the predictive accuracy of this statistical analysis, a

determinant ‘ζ’ has been calculated and presented alongside with the RMSECV.

Zeta scores ζ 4.4.7

The RMSE of cross-validation and how it relates to the standard deviation of the

reference (crystal) structure(s) is calculated as the ratio of RMSECV (δ) of the FTIR

structure divided by the standard deviation (σ) of the crystal structure.(174) that is, the

standard deviation of the protein secondary structure motif in the reference set;

therefore,

ζ = (δFTIR/ σy) eq. (4.4)


116

This score compares the distributed width of the reference structure to that of the

RMSECV and is used to compare the prediction accuracy as it accounts for the natural

variation of the crystal structure data to that of the calculated structure obtained from the

FTIR/PLS model. (175, 176) A ζ value of one indicates that the secondary structure

prediction is as accurate as what would be found from any random guess for the average

value.(174, 175) A value less than one indicates that the FTIR prediction values of that

structure are better than the value from X-ray crystallography for that structure and a

value around one means both methods give answers that are comparable. A value much

higher than one indicates that the performance of the crystal structure data is

better.(175, 177) Sometimes the formula above (in eq. 4.4.) is inverted so that the

standard deviation of the crystal structure is divided by the calculated RMSE of the

FTIR structure. In either case, if the calculated value is less than 1 the numerator is the

better predictor but if the value is greater than 1, the denominator is assigned as the

better predictor. Using equation 4.4 the ζ scores for α-helix, β-sheet, and parallel β-

sheet from the Amide I data are comparable to that from their crystal structures. The ζ

values for data in the Amide I & II regions together also gave good predictions for the

same secondary structures. The two scores, RMSECV and ζ, should both be considered

because a secondary structure with few samples in the data space could give a low

RMSE and ζ would indicate this case.

PCR/PLS comparison 4.4.8

In Figure 4.8, it was seen that it took 5 PCR components to give a low RMSECV. As

the object of this comparison is to show the difference in fitting between PLS and PCR

with similar components, this was done using normalized SNV data alone.

Normalization of spectral intensity and use of Singular Value Decomposition (SNV)

have been introduced and explained in the Multivariate Analysis section of Chapter 1.



The use of ‘Normalized +SNV’ for this comparison also showcases the difference in

performance between this type of pre-processing and the ‘Normalized + 2nd

Derivative’

method (Table 4.1). PCR produced a lower RMSECV for cross-validation (0.104) and

a ζ score of 0.437 for α-helix compared to PLS (SNV) cross-validation values of 0.143

and a ζ score of 0.603; PLS with 2nd

derivative gave RMSECV of 0.181 and ζ score of

0.794 however the calibration and prediction with the independent testset gave better

values than PCR and PLS (SNV), see Table 4.1a for α-helix. The α-helix ζ score was

derived from dividing RMSECV by the standard deviation of the α-helix motif of the

crystal structure used in the reference set; this is done for all secondary structure. ζ is

explained in section 4.4.7.


118

α-H

elix

Meth

od

PreP

ro

cesin

gS

pectr

al R

eg

ion

Co

mp#

R2

RM

SE

Cζ -

PL

SC

Vco

mp#

R2 (

CV

)R

MS

EC

Vζ -

CV

R2 (

pred)R

MS

EP

ST

DD

EV

PC

RN

orm

+ S

NV

Data

Am

ide1

10

0.7

30.1

22

0.5

13

5-

0.1

04

0.4

37

0.7

00.1

23

0.2

38

PL

SN

orm

+ S

NV

Data

Am

ideI

10

0.8

80.0

81

0.3

40

40.7

30.1

43

0.6

03

0.6

90.1

26

0.2

38

PL

SN

orm

+ 2

nd

D

eriv

Am

ideI

10

0.9

60.0

40

0.1

75

40.8

50.1

81

0.7

94

0.7

10.1

20

0.2

28

Am

ideII

15

0.9

80.0

31

0.1

36

40.6

50.2

22

0.9

74

0.6

30.1

64

0.2

28

Am

ideI

&II

10

0.9

80.0

28

0.1

24

50.9

10.1

76

0.7

73

0.6

70.1

32

0.2

28

__________

_______________________________________________________________________________________________________________

________

_______________________________________________________________

_________

β-S

heet

Meth

od

PreP

ro

cesin

gS

pectr

al R

eg

ion

Co

mp#

R2

RM

SE

Cζ -

PL

SC

Vco

mp#

R2 (

CV

)R

MS

EC

Vζ -

CV

R2 (

pred)R

MS

EP

ST

DD

EV

PC

RN

orm

+ S

NV

Data

Am

ide1

10

0.7

80.0

72

0.4

61

5-

0.0

14

0.0

90

0.5

90.0

88

0.1

56

PL

SN

orm

+ S

NV

Data

Am

ideI

10

0.9

00.0

47

0.3

01

30.7

70.0

88

0.5

67

0.6

90.0

77

0.1

56

PL

SN

orm

+ 2

nd

D

eriv

Am

ideI

10

0.9

50.0

20

0.1

28

90.9

40.1

09

0.6

99

0.7

80.0

70

0.1

56

Am

ideII

14

0.9

60.0

31

0.1

96

40.6

80.1

79

1.1

48

0.5

30.1

01

0.1

56

Am

ideI

&II

10

0.9

90.0

16

0.1

03

70.9

60.1

00

0.6

40

0.7

70.0

71

0.1

56

__________

_______________________________________________________________________________________________

_____________________________________________

__________________

_________________

Pa

ra

llel β-S

heet

Meth

od

PreP

ro

cesin

gS

pectr

al R

eg

ion

Co

mp#

R2

RM

SE

Cζ -

PL

SC

Vco

mp#

R2 (

CV

)R

MS

EC

Vζ -

CV

R2 (

pred)R

MS

EP

ST

DD

EV

PC

RN

orm

+ S

NV

Data

Am

ideI

10

0.0

50.0

35

0.9

19

6-

0.0

06

0.1

57

0.0

10.0

39

0.0

38

PL

SN

orm

+ S

NV

Data

Am

ideI

15

0.7

30.0

18

0.4

72

30.7

30.0

61

1.5

88

0.0

30.0

34

0.0

38

PL

SN

orm

+ 2

nd

D

eriv

Am

ideI

15

0.9

60.0

07

0.1

84

30.4

40.0

50

1.3

02

0.0

80.0

27

0.0

38

Am

ideII

11

0.9

20.0

92

2.4

14

30.4

30.0

45

1.1

87

0.0

80.0

31

0.0

38

Am

ideI

&II

10

0.9

60.0

08

0.2

09

30.4

90.0

49

1.2

76

0.1

60.0

30

0.0

38

__________

_______________________________________________________________________________________________

_____________________________________________

__________________

_________________

Anti

-Pa

ra

llel β-S

heet

Meth

od

PreP

ro

cesin

gS

pectr

al R

eg

ion

Co

mp#

R2

RM

SE

Cζ -

PL

SC

Vco

mp#

R2 (

CV

)R

MS

EC

Vζ -

CV

R2 (

pred)R

MS

EP

ST

DD

EV

PC

RN

orm

+ S

NV

Data

Am

ideI

10

0.7

40.0

77

0.5

09

5-

0.0

05

0.0

35

0.5

80.0

85

0.1

52

PL

SN

orm

+ S

NV

Data

Am

ideI

12

0.9

10.0

45

0.2

96

30.7

30.0

94

0.6

21

0.6

50.0

77

0.1

52

PL

SN

orm

+ 2

nd

D

eriv

Am

ideI

10

0.9

40.0

35

0.2

30

80.9

40.1

09

0.7

19

0.7

80.0

63

0.1

52

Am

ideII

15

0.9

70.0

24

0.1

59

20.3

70.1

73

1.1

37

0.2

30.1

23

0.1

52

Am

ideI

&II

10

0.9

90.0

16

0.1

06

80.9

70.1

01

0.6

64

0.7

50.0

67

0.1

52

Tab

le 4

.1a.

P

redic

tion

res

ult

s o

f P

LS

model

s of

cali

bra

tion a

nd v

alid

atio

n s

ets

for

amid

e I,

am

ide

II,

and

am

ide

I &

II,

usi

ng

Norm

aliz

ed +

2nd d

eriv

ativ

e p

re-p

roce

ssed

dat

a ar

e sh

ow

n a

s gre

y h

ighli

ghts

(ro

ws

3,4

& 5

).

Res

ult

s fo

r P

CR

an

d P

LS

met

ho

ds

on

‘Norm

aliz

ed +

SN

V’

are

also

sh

ow

n i

n t

he

firs

t an

d s

eco

nd r

ow

. F

or

all

met

hods,

60 p

rote

in s

amp

les

wer

e u

sed

for

cali

bra

tio

n a

nd

for

inte

rnal

cro

ss-v

alid

atio

n, 4

4 s

ample

s w

ere

use

d f

or

indep

enden

t te

st s

et.



Mix

ed β

-Sheet

Meth

od

PreP

rocesin

gS

pectr

al R

egio

nC

om

p#

R2

RM

SE

Cζ -

PL

SC

Vcom

p#

R2 (

CV

)R

MS

EC

Vζ -

CV

R2 (

pred)R

MS

EP

ST

DD

EV

PC

RN

orm

+ S

NV

Data

Am

ideI

10

0.2

20.0

12

0.8

76

5-

0.0

00

0.0

09

0.1

40.0

13

0.0

14

PL

SN

orm

+ S

NV

Data

Am

ideI

17

0.7

90.0

06

0.4

41

30.1

80.0

13

0.9

55

0.1

60.0

13

0.0

14

PL

SN

orm

+ 2

nd D

eriv

Am

ideI

15

0.9

10.0

04

0.2

94

90.7

90.0

24

1.7

65

0.3

80.0

13

0.0

14

Am

ideII

15

0.9

40.0

03

0.2

43

30.4

10.0

16

1.1

57

0.3

20.0

15

0.0

14

Am

ideI

&II

11

0.9

50.0

03

0.2

18

30.5

00.0

14

1.0

50

0.2

70.0

14

0.0

14

31

0-h

elix

Meth

od

PreP

rocesin

gS

pectr

al R

egio

nC

om

p#

R2

RM

SE

Cζ -

PL

SC

Vcom

p#

R2 (

CV

)R

MS

EC

Vζ -

CV

R2 (

pred)R

MS

EP

ST

DD

EV

PC

RN

orm

+ S

NV

Data

Am

ideI

10

0.1

20.0

30

0.9

29

5-

0.0

01

0.0

24

0.0

10.0

29

0.0

32

PL

SN

orm

+ S

NV

Data

Am

ideI

15

0.6

00.0

20

0.6

29

20.0

90.0

33

1.0

34

0.0

50.0

30

0.0

32

PL

SN

orm

+ 2

nd D

eriv

Am

ideI

15

0.9

20.0

08

0.2

52

30.5

30.0

37

1.1

76

0.2

60.0

29

0.0

32

Am

ideII

11

0.9

00.0

10

0.3

08

20.2

70.0

37

1.1

55

0.2

80.0

15

0.0

32

Am

ideI

&II

10

0.9

50.0

07

0.2

12

30.4

10.0

38

1.2

03

0.1

60.0

26

0.0

32

β-t

urns

Meth

od

PreP

rocesin

gS

pectr

al R

egio

nC

om

p#

R2

RM

SE

Cζ -

PL

SC

Vcom

p#

R2 (

CV

)R

MS

EC

Vζ -

CV

R2 (

pred)R

MS

EP

ST

DD

EV

PC

RN

orm

+ S

NV

Data

Am

ideI

13

0.1

50.0

39

0.9

14

4-

0.0

01

0.0

34

0.0

30.0

44

0.0

42

PL

SN

orm

+ S

NV

Data

Am

ideI

15

0.6

00.0

27

0.6

30

30.1

30.0

46

1.0

85

0.0

80.0

45

0.0

42

PL

SN

orm

+ 2

nd D

eriv

Am

ideI

15

0.9

40.0

10

0.2

37

40.6

00.0

51

1.2

19

0.2

30.0

43

0.0

42

Am

ideII

12

0.9

10.0

13

0.3

04

40.4

50.0

51

1.2

02

0.0

50.0

42

0.0

42

Am

ideI

&II

10

0.9

70.0

07

0.1

75

40.6

50.0

53

1.2

59

0.1

20.0

41

0.0

42

__________

_______________________________________________________________________________________________

_______________________________________

__________________

Oth

er

Meth

od

PreP

rocesin

gS

pectr

al R

egio

nC

om

p#

R2

RM

SE

Cζ -

PL

SC

Vcom

p#

R2 (

CV

)R

MS

EC

Vζ -

CV

R2 (

pred)R

MS

EP

ST

DD

EV

PC

RN

orm

+ S

NV

Data

Am

ideI

10

0.4

90.0

76

0.5

30

3-

0.0

06

0.0

43

0.1

80.1

42

0.1

44

PL

SN

orm

+ S

NV

Data

Am

ideI

15

0.8

80.0

37

0.2

60

30.4

70.0

97

0.6

74

0.2

00.1

40

0.1

44

PL

SN

orm

+ 2

nd D

eriv

Am

ideI

10

0.9

00.0

45

0.3

13

40.6

20.1

57

1.0

96

0.6

50.0

82

0.1

44

Am

ideII

12

0.9

40.0

34

0.2

39

10.2

60.1

64

1.1

41

0.6

30.0

66

0.1

44

Am

ideI

&II

10

0.9

70.0

24

0.1

67

40.7

20.1

98

1.3

77

0.6

90.0

82

0.1

44

R2 =

co

rrel

atio

n c

oef

fici

ent

of

the

model

for

that

str

uct

ure

. R

MS

EC

= c

alcu

late

d s

tan

dar

d e

rro

r of

the

stru

cture

dis

trib

uti

on

in t

he

cali

bra

tio

n s

et. R

MS

EC

V =

cal

cula

ted s

tandar

d e

rror

of

the

stru

cture

dis

trib

uti

on f

rom

th

e cr

oss

-val

idat

ion o

f th

e c

alib

rati

on

set

.

RM

SE

P =

cal

cula

ted s

tandar

d e

rror

of

the

stru

cture

dis

trib

uti

on i

n t

he

indep

enden

t te

st s

et.

ζ =

RM

SE

/ST

DD

EV

(S

tan

dar

d

Dev

iati

on).

Tab

le 4

.1b

. P

red

icti

on r

esult

s of

PL

S m

odel

s of

cali

bra

tion a

nd v

alid

atio

n s

ets

for

amid

e I,

am

ide

II,

and

am

ide

I &

II

, u

sin

g

No

rmal

ized

+ 2

nd d

eriv

ativ

e pre

-pro

cess

ed d

ata

are

show

n a

s gre

y h

ighli

gh

ts (

row

s 3

,4 &

5).

R

esu

lts

for

PC

R a

nd

PL

S m

eth

od

s o

n

‘Norm

aliz

ed +

SN

V’

are

also

sh

ow

n i

n t

he

firs

t an

d s

eco

nd

ro

w.

For

all

met

hods,

60 p

rote

in s

ample

s w

ere

use

d f

or

cali

bra

tion

and f

or

inte

rnal

cro

ss-v

alid

atio

n,

44 s

ample

s w

ere

use

d f

or

indepen

den

t te

st s

et. – C

on

tinu

e f

rom

tab

le 4

.1a


120

Results for Proteins in D2O 4.4.9

Protein spectra show strong bands in the Amide I and Amide II regions around 1645

cm-1

and at 1550 cm-1

, respectively. H2O is often a preferable solvent to D2O because

H2O retains the native properties of the protein while D2O changes the protein

properties somewhat as compared to the native structure, also Amide hydrogens in the

protein could readily exchange with deuterium.(13, 50, 53, 178) However, protein

absorption peaks in the Amide I region (1600 – 1700 cm-1

) overlap the absorption of

water band in the same region (around 1643 cm-1

), making it difficult to extrapolate

information from H2O solvent spectrum. One solution is to use D2O as a solvent

because it does not absorb in the same region as the Amide I water absorbance.(13, 50)

However, use of D2O can cause problems; whereas the Amide I region is easily

observed, the Amide II region can be affected by the D2O solvent absorption which

then poses issues for conformational studies.(50)

With the advent of better IR

instruments and software, protein spectra of H2O solvent are now considered to be

suitable for secondary structure analysis.

For this project, selected proteins collections in both solvents were used to investigate

and compare their proteins structure in the Amide I region. The same techniques and

validation steps applied to the proteins dissolved in H2O were also applied to 16

selected proteins in D2O in order to analyse the effect of the solvent on the secondary

structure determination of proteins using same method (FTIR and multivariate

modelling).

Overall, the predicted values of α-helix and β-sheet for the 16 proteins measured in both

H2O and D2O are similar; however, D2O did remarkably better for myoglobin a-helix



structure prediction with a value of 0.70 which is close to the reference set value (0.74)

while the same protein in H2O gave a value of 0.57. The RMSEP for α-helix from

proteins in H2O vs proteins in D2O are 0.066 and 0.069 respectively, while the RMSEP

for their β-sheet structure are 0.044 and 0.045; this shows H2O in the lead. Table 4.2

shows good results for α-helix and β-sheet in the Amide I for both H2O and D2O; the

accuracy of multivariate results of proteins in H2O makes it unnecessary to carry out

spectroscopic studies in D2O. Results for other structures are listed in Appendix B,

Tables B1-B6.

Table 4.2. Comparison of PLS results of the secondary structure of proteins in H2O and

D2O. The percentage secondary structure for each protein derived from X-ray

crystallography served as a reference.

PLS and other quantitative analysis methods have been used to quantify protein

secondary structure in H2O and D2O.(157) α-helices and β-sheets bands of Amide I

have been published by Kong and Yu (13) for secondary structure assignments for both

α-Helix β-Sheet

PDBID Protein Names

Observed

(X-ray)

Predicted

(proteins

in H2O)

Predicted

(proteins

in D2O)

Observed

(X-ray)

Predicted

(proteins in

H2O)

Predicted

(proteins in

D2O)

3CWL alpha-1-antitypsin 0.28 0.20 0.31 0.31 0.33 0.32

3M1K carbonic anhydrase (human) 0.08 0.10 0.05 0.29 0.35 0.33

1YPH chymotrypsin(bovin) 0.00 0.05 0.05 0.35 0.35 0.36

1HRC cytochrome (Horse heart) 0.41 0.52 0.44 0.00 0.02 0.04

2QSP Hemoglobin bovine 0.67 0.67 0.58 0.00 0.00 0.02

2LYZ lysozyme (chicken egg) 0.32 0.41 0.43 0.06 0.07 0.04

1NZ2 myoglobin(horse muscle) 0.74 0.57 0.70 0.00 0.03 0.00

3DLW native-anti-chymotrypsin 0.29 0.29 0.17 0.36 0.24 0.32

1T1F native form of anti-thrombin 0.26 0.27 0.21 0.27 0.34 0.38

2OAY native form of c1 inhibitor 0.27 0.29 0.25 0.32 0.28 0.26

1YX9 pepsin(pepsin gastric mucosa) 0.10 0.12 0.16 0.42 0.41 0.33

1RBX ribonucleaus A (bovine pancreas) 0.20 0.18 0.10 0.30 0.29 0.31

3DJV ribonucleaus S (bovine pancreas) 0.19 0.20 0.30 0.33 0.32 0.33

1ATH split form of anti-thrombin 0.27 0.29 0.30 0.32 0.33 0.30

1LQ8 split form of c1 inhibitor 0.28 0.29 0.25 0.35 0.28 0.32

2TGA trypsinogen(bovine pancreas) 0.07 -0.01 0.13 0.32 0.35 0.34

R2

0.88 0.87 0.90 0.90

rmsep 0.066 0.069 0.044 0.045


122

H2O and D2O and Dousseau and Pezolet (41) gave a good paper on protein structural

content in both solutions The results from in their Partial Least Squares analysis of 13

proteins in both H2O and D2O had some proteins in common to the ones used in this

study (chymotrypsin, cytrochrome c, lyzosyme, myoglobin, ribonuclease A, anti-

trypsin). Their D2O results overall did poorer than H2O especially with the prediction

of α-helix and myoglobin in particular, which is not the case with this study. The

prediction of secondary structure of proteins in both solvents from this study was not

actually very good; therefore proteins were collected in the solvent that retains their

native like structure which is water.

4.5 Other PLS Applications

Interval Partial Least Squares (iPLS) is a variant of PLS used to perform regression

analysis on sub-intervals of the protein spectra. Local models are obtained from chosen

interval(s) based on RMSECV performance. Secondary structure methods developed

from sub-intervals may give better predictions over methods based on the whole

spectral space. An advantage to iPLS is that the use of spectral regions that can

introduce noise can be avoided;(157) also, known prediction ability of sub-regions of

FTIR spectra can possibly lead to the improvement of instruments that can reduce

production cost by only employing a few significant spectral regions.(157)

iPLS regions of importance are not always contained within one or two intervals so the

spectral regions must be split in several different ways to find the optimal spectral

model. This time consuming effort contributed to the reasons why it was not further

pursued, in addition, significant spectral regions were identified by the use of PCA.



Methods used and results from iPLS as it was applied to our dataset are presented

below.

iPLS Method 4.5.1

FTIR spectral collection was performed for iPLS in the same manner as for the

collection procedure described in Chapter 2. The Kennard Stone (KS) algorithm was

used in splitting spectra into calibration and test sets as was done for the traditional PLS

analysis in Section 4.2.1. For iPLS, the spectra were normalized and split into smaller

frequency regions of roughly the same size. Norgaard’s iPLS_ToolBox for Matlab

(179) was used in the implementation of this work. The software allows the matrix of

the entire spectra to be split into several wavenumber number intervals as shown in

Table 4.3; iPLS is performed on the entire spectra (global) and on the intervals (local)

simultaneously. The interval with the lowest error indicates the area of most significant

importance in terms of structural information.

For each type of secondary structure, 7 spectral intervals were initially assigned for

building the model. PLS models were calculated for each of the sub-intervals, using 5

latent variables and a ‘leave 5% out’ for cross validation across all sub-intervals. A plot

of the 5 latent variables (PLS components) and their Root Mean Square Error for Cross

Validation (RMSECV) is shown in Figure 4.13. The component with a minimum

RMSECV can be seen at number 2, therefore 2 components was again chosen to model

all sub-intervals.


124

Figure 4.13. Plot of RMSECV vs. PLS component for the global model. Component

number 2 has the lowest RMSECV.

An RMSECV was re-calculated for each sub-interval and for the full-spectrum. This

variable selection and the results are shown in Figure 4.14a-b for α-helix and β-sheet;

the columns represent the individual spectral regions (the intervals), and the numbers in

italics denote the optimal number of components for each interval.(180) From the plot,

interval number 5 (1645-1630 cm-1), out of all, has the lowest RMSECV. This model

can be compared against the global model which involves the entire Amide I spectral

region, although, the performance of interval number 5 is worse than the performance of

the global model (The doted horizontal line across the plot indicates the global

RMSECV), it has an RMSECV that is lower than that of the other iPLS spectral

intervals. A calibration model based on the 3 components indicated on interval number

5 which covers the frequency range of 1641 cm-1

to 1628 cm -1

was developed (please

see Table 4.3 for the exact wavenumber per interval). Figure 4.15a shows the spectral

region used for iPLS calibration of α-helix content.

For β-sheet, 3 components were used to calculate RMSECV for 7 spectral intervals and

interval number 6 had an RMSECV that is lower than that of the other iPLS spectral

0 1 2 3 4 5 0

0.05

0.1

0.15

0.2

0.25


RM

SE

CV



intervals and also lower than the RMSECV of the global model. A calibration model

based on the 3 components indicated on interval number 6 which covers wavenumber

1627 cm-1

to 1614 cm-1

was developed. This β-sheet spectral region deviates from β-

sheet spectral assignment reported by other literatures. Figure 4.15b depicts the spectral

region used for iPLS calibration of α-helix.

Figure 4.14a Each bar in the plot is a spectral interval. The plot shows interval numbers

1-7 ; Interval number 5 corresponds to frequency factors in the range 1641 cm-1

– 1628

cm-1

for α-helix, The dotted line represents the RMSECV for the global iPLS model.

Numbers in italics in each interval bar signifies the number of iPLS components used in

fitting each local model. See Table 4.3 for intervals and wavenumber range. The curve in

the background is a spectrum of the Amide I region.

1 2 3 4 5 6 70

0.05

0.1

0.15

0.2

RM

SE

CV

iPLS Interval number

1 4 2 3 3 3 1


126

Figure 4.14b. Plot shows interval number 6 (6th bar) corresponding to frequency factors in

the range 1614–1627 cm-1

for β-sheet. See Table 4.3 for each interval actual beginning and

ending wavenumbers). Dotted line is the RMSECV for the global iPLS model. The dotted

line represents the RMSECV for the global iPLS model. Numbers in italics in each

interval bar signifies the number of iPLS components used in fitting each local model.

The curve in the background is a spectrum of the Amide I region.

Table 4.3. Table of intervals and their corresponding frequencies; interval number 8 which

covers the Amide I region was used for the iPLS global model.

1 2 3 4 5 6 70

0.05

0.1

RM

SE

CV

iPLS Interval number

1 4 2 5 3 3 3

Interval

Start

variable

End

variable

Start

wavenumber

End

wavenumber

Number of

variables

1 1 15 1700 1686 15

2 16 30 1685 1671 15

3 31 45 1670 1656 15

4 46 59 1655 1642 14

5 60 73 1641 1628 14

6 74 87 1627 1614 14

7 88 101 1613 1600 14

8 1 101 1700 1600 101



Figure 4.15a. Plot of the Amide I normalized spectral region shows the spectral region (in

grey bar) used for iPLS analysis of α-helix structure. . The grey bar contains the most

variation for of α-helix structure. This region (1641–1628 cm-1

) corresponds to interval

number 5 from Table 4.14a.

Figure 4.15b. Plot of the normalized spectral region of Amide I that contain the most

significant region (in grey bar) for β-sheet structure. The grey bar contains the most

variation for β-sheet structure. This region (1627 cm-1

–1614 cm-1

) was assigned to

interval number 6 from Figure 4.14b.

1600 1620 1640 1660 1680 17000

0.2

0.4

0.6

0.8

1

Norm

alized inte

nsitie

s

Wavenumber cm-1

1600 1620 1640 1660 1680 17000

0.2

0.4

0.6

0.8

1

Wavemumber cm-1

No

rma

lize

d in

ten

sitie

s


128


The α-helix and β-sheet secondary structural motifs of proteins have well defined

hydrogen bonding properties which makes them relatively compact and, as a result,

these structures are better defined than other looser spatial structural arrangements such

as turns and coils.(157) This is very evident in the iPLS results obtained from this

work. The selected intervals for all the structural motifs used in this method gave better

prediction over all other intervals that were tried out.

α-helix 4.6.1

Figure 4.16a shows the global prediction fraction of α-helix from the Amide I region vs.

the X-ray diffraction-derived reference fraction for the same structure. The RMSECV

of 0.144 is a little better than that of the traditional PLS model with RMSECV of 0.181

but the iPLS R2 for cross-validation (0.80) is lower than the value for 0.85 found for the

traditional PLS. This difference is not so dramatic to be a deciding factor for which

method to use. The RMSECV figure could have been higher because the model was fit

with 4 components while iPLS model was fit with 3 components.

For each prediction of secondary structure content produced for these intervals, the

factors used to assess the prediction accuracy, were based on the iPLS prediction vs. the

reference X-ray derived PDB data for the secondary structure. The plot in Figure 4.16b

shows the ‘line of fit’ for α-helix from the interval number 5 model (1641 -1628 cm-1)

with 3 latent variables (components). This model gave a RMSECV of 0.149 and an R2

of 0.78 which is below the performance of the global model. Finding a way to divide a

matrix of spectra to give optimal iPLS results is a daunting tasks and required lots of



trials; sometimes two intervals may need to be merged into one to get accurate results

from the structural analysis.

Figure 4.16a. Prediction line of fit for α-helix from the iPLS global model in the in Amide

I region (1700 – 1600 cm-1

). 3 iPLS components were used for building the model. The

numbers in the plot represent the different proteins in the dataset which are listed in Table

4.4

Figure 4.16b. Prediction line of fit for α-helix from the iPLS local model in the in Amide I

region (1641 cm-1

– 1628 cm-1

). 3 PLS components were used for building the model. The

numbers in the plot are the different proteins in the dataset which are named in Table 4.4.

0 0.2 0.4 0.6 0.8

0

0.2

0.4

0.6

0.8

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

3031

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

Experimental (X-ray Crystallography: -helix)

Pre

dic

ted (

FT

IR/ iP

LS

)

R2 = 0.8026

RMSECV = 0.1442

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

2 3

4

5

6

7

8

9

10

11

12

1314

15

16

17

1819

20

21

22

23

24

25

26

27

28

2930

31

32

33

34

35

3637

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

Experimental (X-ray Crystallography: -helix)

Pre

dic

ted (

FT

IR/iP

LS

)

RMSECV = 0.1495R

2 = 0.7823


130

β-sheet 4.6.2

Figure 4.17a shows the global iPLS model plot for the prediction fraction of β-sheet for

the Amide I region vs. X-ray reference fraction from the PDB for the same structure.

The RMSECV for this global model is 0.092 with an R2 of 0.81 while the RMSECV

from the traditional PLS model for β-sheet was 0.050 with R2 of 0.94. The global iPLS

model for β-sheet did much more poorly than the traditional PLS model but this could

be due to the number of components used for iPLS global for Amide I (3 vs. 9

components for traditional PLS). The plot in Figure 4.17b shows the ‘Line of Fit’ for

β-sheet from the model of interval number 6 (1627 -1614 cm-1) with 3 latent variables

(components). This model gave a RMSECV of 0.087 and an R2 of 0.83 which is much

better than the performance of the iPLS global model. Again, this could also be due to

the number of components, 3, used for both local model with 14 variables and global

model with 101 variables; see Table 4.3.

Figure 4.17a. Prediction line of fit for β-sheet from the iPLS global model in the Amide I

region (1700 cm-1

– 1600 cm-1

). 3 iPLS components were used for building the model. The

numbers in the plot are the different proteins in the dataset which are named in Table 4.4.

0 0.1 0.2 0.3 0.4 0.5 0.60

0.1

0.2

0.3

0.4

0.5

0.6

1

2

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

3031

32

33

34

35

36

37

38

39

40

41

42

43

44 45

46

47

48

4950

51

52 53

54

55

56

57

58

59

60

R2 = 0.8052

RMSECV = 0.0924

Experimental (X-ray Crystallography: -sheet)

Pre

dic

ted (

FT

IR/iP

LS

)



Figure 4.17b. Prediction line of fit for β-sheet from the iPLS local model in the Amide I

region (1627 cm-1

– 1614 cm-1

). 3 iPLS components were used for building the model

from the 6th spectral interval. The numbers in the plot are the different proteins in the

dataset which are named in Table 4.4.

4.7 Validation Set

The Kennard Stone algorithm was used for the splitting of the protein samples into

Training and Test sets. Although the number of proteins in the independent test set were

the same for both PLS and iPLS, a few factors were different between them. The

proteins in the training set for both methods were not completely identical, several

proteins in the calibration set of the PLS method were in the test set of the iPLS method

This is because for iPLS, data selection for calibration set and test set were based on

intervals, regions of the spectra, instead of the full spectra. The number of components

used for the prediction test set models was based on the error of the cross-validation

performed for calibration models. The results of the iPLS validation of 44 proteins in

the test set were as good as the results of the traditional PLS validation result which is

very promising for iPLS method. This comparison is shown in Table 4.5. Figure 4.15a

shows that the model for α-helix fits the data and the Root mean Square Error for

Prediction (RMSEP) is 0.134 with an R2 of 0.78. The β-sheet model also fits the data

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

1617

18

19

20

21

22

23

24

25

26

27

28

2930

31

32

33

34

35

36

37

38

39

4041

42

43

44

4546

47

48

49

50

5152

53

54

55

56

57

58

59

60

Experimental (X-ray Crystallography: -heet)

Pre

dic

ted (

FT

IR/iP

LS

)

R2 = 0.8281

RMSECV = 0.0868


132

well with an RMSEP value of 0.089 and an R2 of 0.79. These results are comparable to

the traditional PLS results. To view the PLS calibration results, cross-validation results,

and independent set validation results, see table 4.2a and 4.2b.

Figure 4.18a. Plot of local model line of fit for the prediction of α-helix in Amide I (1641

cm-1

– 1628 cm-1

). 3 PLS components for the 5th spectral interval were used for the

prediction of proteins in the test set.. The numbers in the plot are the different proteins in

the dataset which are named in Table 4.4

Figure 4.18b. Plot of local model line of fit for the prediction of β-sheet in Amide I (1627

cm-1

– 1614 cm-1

). 3 PLS components for the 6th spectral interval were used for the

prediction of proteins in the test set. The numbers in the plot are the different proteins in

the dataset which are named in Table 4.4.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

2526

27 28

29

3031

32 3334

35

36

37

3839

40

41

42

43

44 45

Experimental (X-ray Crystallography: -heet)

Pre

dic

ted (

FT

IR/iP

LS

)

R2 = 0.7893

RMSEP = 0.1349

0 0.1 0.2 0.3 0.4 0.5

0

0.1

0.2

0.3

0.4

0.5

1 2

3

4

5

6

7

8

9

10

11

1213

14

15

16

17

18

19

20

21

22

23

24 25

26

27

28

29

30 31

32

33

34

35

36

37

38

39

40

41

42

43

4445

R2 = 0.7927

RMSEP = 0.0894

Pre

dic

ted (

FT

IR/iP

LS

)

Experimental (X-ray Crystallography: -sheet)



Table 4.4. The list of proteins used for the both calibration and test set in the iPLS. These

proteins are listed to match the numbers in the line of fit plot for both calibration and

validation for iPLS models.

Table A2 in Appendix A lists the full name and source (organism) of these proteins”

Num Proteins in Calibration Set Num Proteins in Test Set

1 annexinV 31 esterase(r.oryzae) 1 pyruvate 23 lectin

2 elastin 32 calmodulin 2 transketalose 24 lipase (wheat)

3 porin 33 fibrinogen 3 carbonic anhydrase 25 fetuin

4 trypsin 34 typsin-chymotrypsin 4 pepsin 26 IgG (bovine)

5 alpha-1-antitrypsin 35 invertase 5 alcohol-dehydrog 27 trypsin(hog)

6 myoglobulin 36 IgG (human) 6 reaction centere R. Sphae. 28 apha-Crystallin

7 avidin 37 lipase(c.rugosa) 7 thaumactin 29 superoxide-dismutase

8 phosvitin 38 albumin (bovine) 8 lysozyme (chicken egg) 30 esterase(b.substilis)

9 proteinase 39 d_amino acid dehydr 9 ovalbumin 31 nt-anti-chymotryp

10 tryptophan synthase 40 pyruvate 10 melittin 32 native-anti-thrombin

11 alpha casein 41 transketalose 11 alpha_lactalbumin 33 lactic dehydrog

12 hexakinase 42 carbonic anhydrase 12 staphylokinase-B 34 split-c1 inhibitor

13 l-lactic dehydrogenase 43 pepsin 13 choline oxidase 35 rhodopsin bleached

14 alpha amylase_type2 44 alcohol-dehydrog 14 ribonuclease B 36 split-anti-chymotryp

15 insulin 45 reaction centere R. Sphae. 15 albumin (human) 37 conalbumin

16 chymotrypsin 46 thaumactin 16 trypsin (bovine) 38 anti-thrombin

17 sact1 DIFF 47 lysozyme (chicken egg) 17 ferritin 39 ribonuclease A

18 subtilisin 48 ovalbumin 18 peroxidase 40 trypsinogen

19 rhodopsin unbleached 49 melittin 19 galactose 41 albumin (sheep)

20 phtosystem II 50 alpha_lactalbumin 20 glucose oxidase 42 collagen

21 concanavalin 51 staphylokinase-B 21 hemoglobin 43 split-ovalbumin

22 l-glutathion 52 choline oxidase 22 cytochrome 44 native-c1 inhibitor

23 aprotinin 53 ribonuclease B

24 purple membrane 54 albumin (human)

25 β-lactoglobulin 55 trypsin (bovine)

26 ubiquitin 56 ferritin

27 pig citrate synthase 57 peroxidase

28 ribonuclease S 58 galactose

29 urease 59 glucose oxidase

30 papain 60 hemoglobin


134

Table 4.5. Table shows results for iPLS models for both global and local regression

analyses. Results of the traditional PLS analysis of the Amide I region (1600-1700 cm-1

)

are listed for comparison.

Calibration Set (60) Validation Set (44)

Structure MethodSpectral Region in cm-1 Interval number CVcomp# R

2 (CV) RMSECV R

2 (pred) RMSEP

α-Helix PLS 1700-1600 Full Spectrum 4 0.85 0.181 0.71 0.120

iPLS 1700-1600 Global Model 3 0.80 0.144 -- --

1641-1628 5 3 0.78 0.149 0.79 0.135

β-Sheet PLS 1700-1600 full Spectrum 9 0.94 0.109 0.78 0.070


1627-1614 6 3 0.83 0.87 0.79 0.089

Parallel-β PLS 1700-1600 full Spectrum 3 0.44 0.050 -0.08 0.027

iPLS 1700-1600 Global Model 2 -0.05 0.038 -- --

1627-1614 6 2 0.03 0.036 0.08 0.035

AntiParallel-β PLS 1700-1600 full Spectrum 8 0.94 0.109 0.78 0.063


1627-1614 6 3 0.78 0.091 0.7 0.095

Mixed-β PLS 1700-1600 full Spectrum 9 0.79 0.024 0.38 0.013


1637-1624 5 2 0.37 0.015 0.41 0.016

310-Helix PLS 1700-1600 full Spectrum 3 0.53 0.037 0.26 0.029

iPLS 1700-1600 Global Model 3 -0.2 0.035 -- --

1649-1634,1657-1644 3,4 1 0.01 0.031 -0.03 0.03

β-Turns PLS 1700-1600 full Spectrum 4 0.60 0.051 0.23 0.043


1687-1674 2 1 0.14 0.043 0.11 0.041

Coils PLS 1700-1600 full Spectrum 4 0.62 0.157 0.65 0.082


1677-1664 3 5 0.54 0.099 0.42 0.136



4.8 Conclusion

FTIR spectra with PLS gives excellent quantitative results of protein structural analysis.

Tables 4.1a and 4.1b show a good prediction of the proteins structural motifs as

compared to the values given by X-Ray crystallography, for α helix, beta sheets and

other. On the average our regression analyses for the calibration set generated

predictions to within 1-5% accuracy of all secondary structure content (α-helix, β-

sheets, 310-helix, turns and coils) in the Amide I region. The independent test set result

was very good as well. The results obtained for proteins in H2O and D2O were

comparable to the reference data of X-ray crystallography; although in this analysis, the

results (from Table 4.2) of proteins in H2O did slightly better than proteins in D2O.

FTIR absorption spectra contain a great deal of information that can be utilised for

classification and quantitative determination of protein and is a widely applicable

technique; the use of Partial Least Squares (PLS) on FTIR spectra has been

demonstrated as an effective method to provide reliable quantitative structural analysis

of proteins.(157) The Amide II region also gave results comparable to that of the

Amide I region, especially for predictions of helix content; the Amide II region is

known for being less sensitive to secondary structure of protein(181) and some literature

report on the analysis of secondary structure in this region have reported less accurate

results so a good prediction from this region was impressive. Tables 4.1a/b show that

Amide I and II combined gave a much accurate result. In addition to the elucidation of

310-helix which has only being done with linear regression model by Goormaghtigh,

Ruysschaert and Raussens without meaningful result as they clearly stated, (151) we

have also been able to differentiate between parallel and anti-parallel β-sheet, and turns

using our method. This is not always done by spectra analysts because parallel β-sheet

are less common and turns are not so structured and are difficult to determine.


136

Identifying those critical regions of a spectra for any given secondary structure motif

can improve production cost and reduce analysis time and improve models(182). iPLS

models applied to the local models of Amide I FTIR protein bands showed better

results than the global models for all structural types except in α-helix where the

RMSECV (0.144) of the global model is a little lower than that of the local model

(0.149). Given the number of components used and the lower RMSECV, the results of

iPLS global model came out a little more accurate that of the full-spectrum of traditional

PLS models; however, the RMSEP of full spectrum PLS was much better than the local

model of iPLS for all structural types, meaning that the traditional PLS model did better

as predicting new samples. For quantitative analysis, iPLS is a valuable tool, especially

for the identification of the spectral regions that are more significant for each structural

motif but the trials and errors involved in identifying these intervals coupled with the

fact that the full spectrum PLS analysis does better at quantifying proteins in

independent test set discouraged the further used of the iPLS method for this project.

With iPLS a priori knowledge of spectral regions for each secondary structure may

guide in the spectral variable selection for iPLS; that being said, a look at other

quantitative methods was valuable.

Given the satisfactory performance of Multivariate Analysis tools such as PLS on FTIR

protein spectra, and because of its effectiveness with collinearity,(104) it can be

confirmed that the combination of the two is a powerful tool for the determination of

protein secondary structure from spectra, including the not so common structural types

like 310-helix.



5 Multivariate analysis of pH effect on α-lactalbumin

5.1 Introduction

In this chapter the developed PLS model for determining the secondary structure of

protein is applied to test the PLS model’s ability to quantitatively determine the

structure of a denatured protein. Bovine -lactalbumin is the protein of choice because

the denatured state of α-lactalbumin (α-LA) is the best characterized folding

intermediate of globular proteins and has been studied by different spectroscopic

techniques (183-185), but it does not have a comprehensive report of its percentage

secondary structure at different pH values.

The backbone hydrogen bonds and the side chain hydrophobic interactions are the most

important bonds that make up the proteins’ secondary and tertiary structure;(186) these

structures can be broken by denaturing conditions and proteins typically need to be in

their native conformations to operate properly, therefore the analysis of their secondary

and tertiary structures is important for characterizing protein stability and function.(106)

The structural changes undergone by proteins can affect their stability, e.g. in drug

formulation where these changes could cause protein aggregation and affect drug

efficacy,(187-189) therefore their higher order structure (secondary and tertiary

structures) must be compared to some reference standard.(106, 190) Structural changes

could also be as a result of misfolding of the protein during biogenesis causing loss of

their functions and also diseases such as Alzheimer's disease and systemic

amyloidoses.(191, 192)

It has been established that proteins in their native states carry out specific biological

functions; understanding what distinguishes a protein’s partially folded states from the


138

fully folded proteins is critical for understanding protein folding and protein

design.(193, 194) These unfolding states can be induced by changing experimental

conditions such as lowering the pH level of the protein.(195) Protein folding

intermediates typically show a well-preserved secondary structure and a reasonably

compact globular structure that, however, lacks the side chain packing found in native

structures.(196) This state of protein folding which has a loose or absent tertiary

structure and a native-like secondary structure is known as the molten globule.(197,

198) The molten globule states has been confirmed to occur in living cells and be

implicated in diseases like Alzheimer's and Type II diabetes.(197, 199, 200)

As mentioned earlier, the molten globule state of α-lactalbumin (α-LA) is the best

characterized folding intermediate of globular proteins. α-LA, a calcium binding

protein found in mammalian milk, has 124 residues, with a native structure divided into

two domains;(193) see Figure 5.1. One of α-LA’s domains (α-domain) is mostly helical

in conformation while the β-domain has mostly β-sheet.(183) Analysis of folding and

unfolding protein pathways have been studied by CD and realtime NMR;(201) the

studies have probed the structural changes of pH dependent transitions from the native

state of pH 7.0 to the acidic state (A-state) of pH 2.0 (183, 202); however, the

quantitative structural information on the different unfolding stages, has not been well

covered.

We have already established, in Chapter 4, that FTIR together with Multivariate

Analysis (MVA) methods like PLS can quantitatively determine the secondary structure

percentage of proteins in solution. To extract quantitative and qualitative information

from denatured sample spectra, multivariate data analysis becomes a useful tool.(196)



In this chapter, MVA methods like Partial Least Squares (PLS) and Principal

Component Analysis (PCA) are applied to bovine α-lactalbumin, to characterize the

secondary structure conformational changes at each unfolding transition as a function of

pH. It has been reported by Troullier et al.(201) that FTIR can differentiate between

native and non-native secondary structure present in folding states. We will investigate

and compare the secondary structure results at both the native and non-native states of

samples of -LA measured with FTIR spectroscopy.

Figure 5.1. The diagram on the left of α-LA (PDB ID: 1HFZ) shows the native structure of

α-LA, the intermediate unfolded state is shown in the middle and the unfolded state is

represented on the right. The PDB reported structure of the bovine α-LA (1HFZ is 43%

helical structure, 9% -sheet, 48% other. The Ca2+

ions are shown in black, the blue ribbon

represents α-helix, the green ribbon represents 310-helix and the green arrows represent β-

sheet


Bovine α-La was bought from Sigma-Aldrich, and was used without further

purification. Solutions of 50 mg/mL of α-La were prepared in deionized H2O; as

mentioned in D’Antonio, et al.(106)

protein concentrations higher than 25mg/mL

provide great precision in IR spectra. Protein unfolding states can be generated by

modification of the experimental conditions, by changing pH, temperature or

pressure.(203, 204) In this experiment, pH levels were adjusted with diluted HCl acid

to induce conformational changes between pH 2.0 and 7.0. Lowering the pH level by


140

~0.5 pH units for every new preparation of α-LA solution gave a total of eleven α-LA

samples.

5.3 ATR-FTIR Spectroscopy

All sample spectra were acquired with ATR-FTIR at 32 scans and a resolution of 4 cm-1

for both background and sample measurement; this is described in Chapter 2 for sample

measurements. The overlapping bands in the Amide I region were subjected to second

derivative analysis using a Savistzky-Golay algorithm with 5 degrees; this reveals the

underlying absorption changes.(201) Second derivatives were also calculated for the

Amide II and Amide III regions to provide a consistent basis for analysis.

5.4 PLS and PCA Modelling Technique

The second derivative spectra of all 11 α-LA samples in the Amide I, Amide II and

Amide III regions of the spectra were fed into the PLS application; the 11 α-LA samples

were used as an independent test set; the calibration set of 75 spectra covering Amide I,

II and III regions were extracted from the database used in Chapters 3 and 4. For data

processing, in-house Matlab (R2013a) routine of PLS and PCA codes were developed.

An internal validation (cross-validation) was done on all spectra and the number of

latent variables chosen from cross-validation was used for the prediction set. The

spectral plot of Figure 5.2 shows unfolded stages of α-LA (arising from changes in the

pH levels) in the Amide I, Amide II and Amide III regions for each spectrum. The

unfolding of the protein structure can be monitored through changes in the frequencies

and intensities of the well-resolved peaks in the FTIR spectra (205)



Figure 5.2. Normalized FTIR spectra of α-LA from pH 7.0 to pH 2.0 covering spectral

region 1900 - 1000 cm-1

.

PCA is designed to handle multivariate data, such as FTIR spectra, and does well in

extracting contributing spectral wavenumbers related to structural conformation; this

has already been established in Chapter 3 (PCA Protein Structure Classification) and the

fundamentals of PCA were explained in Chapter 1.

The measured pH adjusted α-LA spectra were normalized and put into a matrix form.

The columns of the matrix represent the absorbance intensity at each wavenumber and

each row of the matrix represents each spectrum at a certain pH. PCA was then applied

to the matrix of α-LA spectra to determine the structural changes as affected by pH.

Since one of the roles of PCA is to reduce the dimensionality of the data thereby

eliminating noise, we looked at the number of principal components (PC) used in

describing the data because it tells us how much of the variance in the data is accounted

for in each component; normally, as close to 95 - 100% of the variation is captured in

1000 1100 1200 1300 1400 1500 1600 1700 1800 19000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Wavemumber (cm-1

)

Norm

alized inte

nsity

pH 2.0

pH 2.5

pH 3.0

pH 3.5

pH 4.0

pH 4.5

pH 5.0

pH 5.5

pH 6.0

pH 6.5

pH 7.0


142

the first four components, of which the first two may hold 80 - 90%.(206) This means

that the information in the data can be studied by looking at the first and second

principal components. We also visualized the plots of the PCA component scores (the

transformed variable values) and the loadings spectra resulting from the Principal

Component as these can help with observing how the samples are related to each other.


Inspection of the Amide I Spectral Plots 5.5.1

A visual inspection of the normalized spectra (Figure 5.3) shows that the spectra of

bovine α-LA at pH 7.0 to 3.5 had the usual Amide I peaks at about 1656 cm-1

. This

peak intensity dropped significantly for pH 3.0 and increased around 1715 cm-1

with the

right shoulder shifted to 1740 cm-1

. This suggests a decrease in the amount of

secondary structure at pH 3.0. For the spectrum of pH 4.0 spectra, there was a shift in

the right shoulder from 1700 cm-1

to 1750 cm-1

and a slight peak was beginning to

develop around 1720 cm-1.

Figure 5.3 The normalized Amide I region of -LA shows the changes and shifts in

spectral peaks as pH decreases from 4.0 to 2.0

1600 1620 1640 1660 1680 1700 1720 1740 1760 1780 18000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Wavemumber (cm-1

)

Norm

alized inte

nsity

pH 2.0

pH 2.5

pH 3.0

pH 3.5

pH 4.0

pH 4.5

pH 5.0

pH 5.5

pH 6.0

pH 6.5

pH 7.0



A shift of the Amide I peak was seen at pH 2.5 and pH 2.0 around 1715 cm-1

. This

visual inspection of the spectra explains the secondary structure quantitative results for

the decrease in β-sheet for pH 3.0 spectra and a big gain in α-helix as at pH 2.5 and pH

2.0. β-sheet picked up as α-helix declined (see section 5.5.2 for quantitative results).

Figures 5.4a/b and 5.5b/c reveals the second derivatives of the spectra of pH 7.0 and pH

2.0 and pH 2.5, and the shift in peaks from 1657 to 1715 cm-1

. The plots clearly

demonstrate that as the pH levels drop, the structure of the protein goes through

changes. The second derivative band at pH 7.0 is much in line with the peak of the

normalized spectrum which has a major contribution for α-helix at 1657 cm-1

and this is

similar to the pronounced band around 1715 cm-1

at pH 2.0. The peak at 1630 cm-1

, a β-

sheet band, is almost non-existent at pH 3.5 and pH 3.0, but becomes more pronounced

at pH 2.0; the PLS analysis results in Table 5.1 reflects this.

Figure 5.4a. Plot of the second derivatives of the FTIR spectrum for α-LA in the native

state at pH 7.0. A peak at 1657-1658 cm-1

is clearly visible. The second derivatives for pH

6.5 and 6.0 (not shown) are very similar.

1600 1620 1640 1660 1680 1700

Wavemumber (cm-1

)

d2A

/dv

2

pH 7.0


144

Figure 5.4b. The plot of the normalized spectra of α-LA at pH 6-7 and the changes in the

intensities and frequencies at about 1657 cm-1

. These spectra of pH 6.5 and pH 6.0 are

similar to the spectrum of pH 7.0 which is in the native state.

The spikes seen in the second derivatives plot in Figure 5.5a were not due to noise in the

spectrum of α-LA at pH 3.0, therefore they must be due to the secondary structural

changes going on at pH 3.0. The pronounced Amide I peaks of 1656 cm-1

seen in

Figure 5.5c are also visible in the second derivative spectra of pH 3.5 and 3.0 (Figure

5.5a); Again, this noisy-like effect seen in Figure 5.5a is due to the conformational

changes that were occurring. At pH 2.5 and pH 2.0, peaks have appeared near 1720

cm-1

and these can be as seen in the plot of Figure 5.5c and their second derivative plot

of Figure 5.5b.

1600 1620 1640 1660 1680 1700 1720 1740 1760 1780 18000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Wavemumber (cm-1

)

No

rm

alize

d in

ten

sity

pH 6.0

pH 6.5

pH 7.0



Figure 5.5a. This figure shows well resolved second derivative of α-LA in the pH 3.0 and

pH 3.5; the second derivative spectrum shows a peak at about 1657 cm-1

. The green

spectrum shows changes in intensity.

Figure 5.5b. This figure shows well resolved second derivative of α-LA at pH 2.0. The

second derivative spectrum shows a peak at about 1720 cm-1

.

1600 1620 1640 1660 1680 1700 1720 1740 1760 1780 1800

Wavemumber (cm-1

)

d2A

/dv

2

pH 3.0

pH 3.5

1600 1620 1640 1660 1680 1700 1720 1740 1760 1780 1800

Wavemumber (cm-1

)

d2A

/dv

2

pH 2.0

pH 2.5


146

Figure 5.5c. Plot of the normalized spectra of α-LA at pH 2.0-3.5 and the shifts in the

intensities and frequencies.

Amide I PLS Results 5.5.2

We first investigated changes in the Amide I band because this band contains

information on the peptide backbone vibrational frequencies based on their hydrogen

bonding, therefore, the major information on structural changes is contained in this

band.(201) First, the secondary structure of bovine α-LA was assessed by the

comparison of the PLS result of pH 7.0 sample with the native reference values from

DSSP. This had to be done to validate the use of this method in the study of other non-

native states of α-LA. From the values presented in Table 5.1, we can see that structural

values of pH 7.0 (α-helix= 0.37, β-sheet = 0.08, 310-helix = 0.10, Turns = 0.17, 0ther =

0.27).are similar to the reference values of pH 8.0 (α-helix= 0.31, β-sheet = 0.07, 310-

helix = 0.14, Turns = 0.18, 0ther = 0.31). There is about -0.06 difference in α-helix and

1600 1620 1640 1660 1680 1700 1720 1740 1760 1780 18000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Wavemumber (cm-1

)

Norm

alized in

tensity

pH 2.0

pH 2.5

pH 3.0

pH 3.5



-0.01 difference in values of β-sheet from the two different methods (FTIR and X-ray

crystallography). The value differences of 310-helix, turns and ‘other’ between the two

methods are within 0.01 and 0.04; which is remarkable. DSSP script was used to

calculate the secondary structure assignments of α-lactalbumin from the X-ray crystal

3D structure entries in the PDB. However, PDB combines α-helix and 310-helix into

one as helical motif, and as such, PDB helical value for bovine α-LA is 43% helical and

9% for β-sheet. A combination of the α-helix and 310-helix values from the FTIR

results (45%) presented in Table 5.1 is in close agreement with X-ray crystallography

values. This separation of secondary structure values by FTIR spectra cannot be

accurate for other methods like CD because with CD, β-sheet, turns and disordered

structure are hard to separate.(207)

The PLS analysis for the Amide I also shows that the sample for pH 6.5 has a native

structure form (α-helix = 0.34, β-sheet = 0.8, 310-helix = 0.12, Turns = 0.16, Other =

0.32) which is close to the DSSP referenced α-LA structure. In these pH structural

values (pH 7.0 to pH 6.5), there was no visible formation of non-native structures,

compared to the native reference protein.

Given the above results, PLS is applied to the FTIR spectra of α-LA at other pHs. The

PLS results show that there were changes in the structure of α-lactalbumin, as the pH

was lowered. There was an increase in α-helix content for pH 6 through pH 3 (as shown

in Table 5.1), especially at pH 3.0 (α-helix = 0.44, β-sheet = 0.04, 310-helix = 0.15,

Turns = 0.18, Other = 0.25). β-sheet was significantly lower at pH 3.0 and then

increased significantly at pH 2.5 and pH 2.0 while α-helix reduced; a decrease in

disordered structure (Other) was also noticeable seen for pH 3.0; see Table 5.1. It can


148

also be seen that there was an increase in α-helix and a decrease in β-sheet at pH 5.5 but

not as significant as that at 3.0. Overall, from the Amide I spectra measured in this

study, there were some changes in the α-helix content between pH 6.0 and 3.5 but β-

sheet increased slowly and was lower at pH 3.0 before increasing drastically; this agrees

well with findings of Kim and Baum (184) who stated that the denaturation profile of α-

LA occurred in the pH range of 3.0 to 5.0. The figures remained stable between pH 4.0

and pH 5.0. The proportions of the secondary structure of the analysed α-LA are

presented in Table 5.1 and Figure 5.3. These secondary structural changes at pH 2.5

and pH 2.0 are also evident in Figure 5.5c and in their FTIR spectra of Figure 5.6.

Table 5.1. PLS prediction results for the secondary structure content of Amide I region of

α-lactalbumin at different pH levels using normalized and 2nd


data. Structural values are presented as proportions calculated from 1HFZ (pH 8.0) file

using DSSP script.

Parβ-Sheet denotes parallel -sheet; Anti par β-Sheet denotes anti-parallel -sheet

Amide I

pH α-Helix residueA β-Sheet residueB Par β-Sheet Anti-par β-Sheet Mixed β-Sheet 310-Helix Turns Other

2.0 0.30 0.01 0.15 -0.08 0.01 0.08 0.00 0.14 0.14 0.29

2.5 0.31 0.00 0.13 -0.06 0.01 0.06 0.00 0.14 0.15 0.30

3.0 0.44 -0.13 0.04 0.03 0.02 -0.04 0.00 0.15 0.18 0.25

3.5 0.41 -0.10 0.10 -0.03 0.02 0.09 0.00 0.08 0.15 0.29

4.0 0.38 -0.07 0.10 -0.03 0.02 0.08 0.00 0.09 0.17 0.30

4.5 0.39 -0.08 0.10 -0.03 0.01 0.10 0.00 0.10 0.16 0.28

5.0 0.38 -0.07 0.11 -0.04 0.01 0.10 0.00 0.10 0.16 0.28

5.5 0.42 -0.11 0.07 0.00 0.01 0.06 0.00 0.10 0.17 0.27

6.0 0.41 -0.10 0.08 -0.01 0.01 0.05 0.00 0.09 0.17 0.29

6.5 0.34 -0.03 0.08 -0.01 0.00 0.05 0.00 0.12 0.16 0.32

7.0 0.37 -0.06 0.08 -0.01 0.01 0.08 0.00 0.10 0.17 0.27

DSSP

8.0 0.31 0.07 0.00 0.07 0.00 0.14 0.18 0.31



Figure 5.6. This plot highlights changes in the secondary structure of α-LA as a function

of pH, based on PLS analysis. The horizontal axis values represents the pH values, the

vertical axis represents the secondary structure proportions. Below pH 3.0, a gain in β-

sheet and a loss in α-helix is seen. The error bars were calculated from the PLS result

residual for each spectrum (which is the observed X-ray value – the calculated FTIR

value).

Inspection of the Amide II Spectral Plots 5.5.3

In the Amide II region we found a main peak of the native form of α-LA at 1546 cm-1

,

as depicted in Figure 5.7, and a peak of the A-state (pH 2.0) at 1521 cm-1

. Although

there appeared to be structural changes in the spectral plot, the change in intensities that

was seen in the Amide I region was not noticed in the Amide II region; however, a shift

in frequencies were noticeable, especially for pH 2.5 and pH 2.0. The spectrum at pH

2.5 developed a noticeable structural change but maintained a peak at 1546 cm-1

.


150

Figure 5.7. The plot of the normalized spectra of α-LA at pH 2.0 -7.0. Plot shows peak

shift and structural changes especially for pH 2.5 and pH 2.0

Amide II PLS Results 5.5.4

For Amide II results, again the comparison between the PLS result of pH 7.0 and the

known structural values (PDB code 1HFZ) of X-ray crystallography calculated by

DSSP was made to validate this method as fit enough to study other pH ranges in this

region (1480-1575 cm-1

); The secondary structure figures for pH 7.0 (α-helix = 0.36, β-

sheet = 0.10, 310-helix = 0.11, Turns = 0.16, disorder = 0.30), are very similar to the

DSSP values (α-helix = 0.31, β-sheet = 0.07, 310-helix = 0.14, Turns = 0.18, disorder =

0.31) with residues of -0.04 and -0.03 for α-helix and β-sheet respectively. Although

there were structural fluctuations from one pH to another, a noticeable structural

similarity was observed at pH 4.5 (α-helix= 0.45, β-sheet=0.05, 310-helix = 0.11, Turns

= 0.16, Other = 0.25) and at pH 5.0 (α-helix= 0.45, β-sheet = 0.05, 310-sheet = 0.11,

1480 1500 1520 1540 1560 1580 16000

0.2

0.4

0.6

0.8

1

Wavemumber (cm-1

)

Norm

alized inte

nsity

pH 2.0

pH 2.5

pH 3.0

pH 3.5

pH 4.0

pH 4.5

pH 5.0

pH 5.5

pH 6.0

pH 6.5

pH 7.0



turns = 0.16, disorder = 0.25); a little decrease in disordered structure was also observed

at these pH levels (pH 4.5 and pH 5.0). Again, comparing these values against those for

the known structure from DSSP (α-helix = 0.31, β-sheet = 0.07, 310-helix = 0.14, Turns

= 0.18, disorder = 0.31), we see that α-helix structural values at these pH levels were

0.14 more than the α-helix value of the reference protein. β-sheet value was reduced at

pH 5.0 and pH 4.5 but it increased drastically after pH 3.5, more than doubled in value

at pH 2.0, while a reduction in α-helix was seen at these acidic states. These results are

listed in Table 5.2.

Table 5.2. PLS Prediction results for the secondary structure of Amide II region of α-LA at

different pH levels using normalized and 2nd

derivative pre-processed data. Structural

values are presented in proportions.

Parβ-Sheet denotes parallel beta sheet; Anti par β-Sheet denotes anti-parallel beta-sheet

Inspection of the Amide III Spectral Plots 5.5.5

Denatured protein produced some alteration in the Amide III band of α-Lactalbumin

spectra as shown in Figure 5.8 and Figure 5.9. It can be observed that the peaks of the

Amide II

pH α-Helix β-Sheet Par β-Sheet Anti-par β-Sheet Mixed β-Sheet 310-Helix Turns Other

2.0 0.25 0.06 0.19 -0.12 0.01 0.16 0.00 0.11 0.18 0.32

2.5 0.32 -0.01 0.13 -0.06 0.00 0.12 0.00 0.08 0.16 0.32

3.0 0.40 -0.09 0.10 -0.03 0.02 0.07 0.00 0.08 0.14 0.28

3.5 0.42 -0.11 0.07 0.00 0.02 0.05 0.00 0.09 0.16 0.29

4.0 0.37 -0.06 0.12 -0.05 0.00 0.10 0.01 0.10 0.15 0.27

4.5 0.45 -0.14 0.05 0.02 0.02 0.02 0.00 0.11 0.16 0.25

5.0 0.45 -0.14 0.05 0.02 0.01 0.03 0.00 0.11 0.16 0.25

5.5 0.43 -0.12 0.06 0.01 0.02 0.04 0.00 0.11 0.16 0.26

6.0 0.33 -0.02 0.12 -0.05 0.00 0.10 0.00 0.10 0.14 0.33

6.5 0.26 0.05 0.14 -0.07 0.02 0.13 0.01 0.12 0.13 0.32

7.0 0.36 -0.05 0.10 -0.03 0.01 0.08 0.01 0.11 0.16 0.30

DSSP

8.0 0.31 0.07 0.00 0.07 0.00 0.14 0.18 0.31


152

spectra of pH 3.4 through pH 2.0 registered at 1278 cm-1

but pH 3.5 and pH 3.0 were

beginning to show alterations at the shoulders. These settled changes were picked up by

the PLS quantitative result in Table 5.3 for α-helix and β-sheet. Amide III bands at

1300, 1260, and 1235 cm-1

regions have been assigned to α-helix, disordered, and β-

sheet structures.(208). The 1250 cm-1

band normally seen for Amide III peak appeared

at 1278 cm-1

; This means that the pH dependent structural changes are not local to the

Amide I region alone but are also effective in the Amide III region These peak values

changes at about the same regions in the Amide III have also been reported by Anderle

and Meldelsohn(209) in their work for denatured proteins.

Figure 5.8. This is the plot of the effect of pH 3.5 (intermediate) – pH 2.0 on the

normalized Amide I spectral region of α-LA protein. Plot shows slight changes at these pH

levels. The different spectral colours represent different pH levels.

Amide III bands for pH 4.0 – pH 5.5 had their main peak at 1250 cm-1

. In addition to

the major peak at 1250 cm-1

which correlates to the β-sheet region of Amide III, pH 5.0

and pH 5.5 had reduced intensities but significant peaks at 1290 cm-1

and 1300 cm-1

1200 1250 1300 13500

0.2

0.4

0.6

0.8

1

Wavemumber (cm-1

)

Norm

alized inte

nsity

pH 2.0

pH 2.5

pH 3.0

pH 3.5



respectively (β-turns); these regions are in agreement with the α-helix regions for

Amide III (α-helix: 1330–1295 cm−1

, β-sheets: 1250–1220 cm−1

, β-turns: 1295–1270

cm−1

, random coils: 1270–1250 cm−1

) reported by Cai and Singh(39) in their study of

infrared bands for secondary structure. The spectra for these pHs were not as smooth as

that of the spectra of the native state α-LA (pH 7.0); this is because of the many spectral

features present in the Amide III region.

Figure 5.9. Plot of Amide III region depicts peaks shift for pH 5.5 - pH 4.0 at 1250 cm-1

and pH 4.0 at 1279 cm-1

respectively.

At pH 7.0 and 6.5, the spectral peak was at 1256 cm-1

as shown by the plot of Figure

5.10 and well in agreement with literature report for the Amide III region peak(210). At

pH 6.0 a peak had developed at 1240 cm-1

and this make the transition to the 1250 peak

seen at pH 5.5. Amide III PLS results obtained in this experiment are close to X-ray

crystallography values for α-lactalbumin.

1200 1250 1300 13500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Wavemumber (cm-1

)

Norm

alized inte

nsity

pH 4.0

pH 4.5

pH 5.0

pH 5.5


154

Figure 5.10. Plot of Amide III region depicts peak for pH 7.0 (native protein) at 1256 cm-1

Amide III PLS Results 5.5.6

Amide I and Amide II regions of FTIR spectra are the two regions popularly used in the

determination of protein secondary structure. The Amide III region (1200-1350 cm-1

) is

often neglected due to its relative weakness, yet this region offers some advantages for

the IR study because the range of vibrational frequency for secondary structure in the

Amide III region is somewhat greater than Amide I or Amide II.(209) Moreover, the

water vapour absorption overlap found in the Amide I region is not an issue in the

Amide III region. PLS results of native and unfolding spectra in this study are

highlighted in Table 5.3. At pH 7.0 , pH 6.5 and especially at pH 6.0, the value of α-

helix increased as compared to its DSSP numbers (See Table 5.3) -sheet increased

tremendously at pH 5.0, 3.0, 2.5 and pH 2.0 while α-helix for these pH values reduced

from the values seen at pH 6.0, pH 6.5 and pH 7.0. sheet contributions decreased at

pH 4.0 and pH 3.5; an increase in β-sheet formation was seen as the pH reached the A-

state (2.0). The spectra of pH 7.0 has the closest values to that of the reference protein,

1200 1250 1300 13500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Wavemumber cm-1

Norm

alized inte

nsity

pH 6.0

pH 6.5

pH 7.0

pH 6.0

pH 6.5

pH 7.0



showing that it is possible to determine the secondary structures of proteins from the

Amide III region of the native protein spectrum (see Table 5.3).

Table 5.3. PLS Prediction results for the secondary structure content of Amide III region of

α -lactalbumin at different pH levels using normalized and 2nd


data. Structural values are presented in proportions.

Parβ-Sheet denotes parallel beta sheet; Anti par β-Sheet denotes anti-parallel beta-sheet

PCA of α-Lactalbumin Spectra 5.5.7

PCA did well in describing the changes in the secondary structure of the α-LA

unfolding process; this PCA analysis was performed on Amide through Amide III. PC1

captured close to 95% of the total Amide I absorbance and the PC2 component captures

about 5%, while PC3 captured 3% of the remaining variances not described by the PC1

and PC2. This means that the variation in the data are well distributed and data noise

distortion was not an issue.

The Principal Component Score plot of Figure 5.11 shows the classification of -LA

spectra based on its pH levels. From this plot it can be seen that between pH 7.0 and pH

4.5, α-LA is close to its native structure as depicted on the left of PC1, while the acidic

lower pH structures lie on the right of the plot. PC2 places the spectra with the most

Amide III

pH α-Helix β-Sheet Par β-Sheet Anti-par β-Sheet Mixed β-Sheet 310-Helix Turns Other

2.0 0.34 0.19 0.03 0.15 0.00 0.09 0.13 0.29

2.5 0.34 0.19 0.03 0.15 0.00 0.09 0.14 0.29

3.0 0.23 0.23 0.00 0.21 0.00 0.09 0.12 0.32

3.5 0.36 0.18 0.02 0.15 0.00 0.10 0.14 0.29

4.0 0.36 0.15 0.00 0.13 0.01 0.14 0.16 0.30

4.5 0.28 0.17 0.00 0.17 0.00 0.10 0.16 0.31

5.0 0.25 0.20 0.02 0.17 0.00 0.09 0.16 0.28

5.5 0.34 0.12 0.01 0.12 0.01 0.13 0.18 0.24

6.0 0.54 -0.11 -0.01 -0.02 -0.01 0.13 0.21 0.17

6.5 0.48 0.06 -0.02 0.07 -0.01 0.13 0.18 0.20

7.0 0.40 0.08 0.01 0.08 0.01 0.14 0.17 0.26

DSSP

8.0 0.31 0.07 0.00 0.07 0.00 0.14 0.18 0.31


156

structural shifts at the top (pH 4.0 to pH 3.0). The score plot shows a gradual structural

shift from 5.5 to 4.5 and then a fast shift to pH 3.5, pH 3.0 and pH 2.5 before slowing

down; this suggests a conformational change in the process of unfolding.(211) This

also supports the Amide I stable intermediate figure obtained from PLS for pH 5.0

through pH 4.0 (see Table 5.1). A clear 3 state transition can be seen in the PCA score

plot of Figure 5.13, the native (N) state on the left of PC1, the unfolded stable (U) state

on the right of PC1 and the intermediate (I) state at the top clearly separated by PC2.

Figure 5.11. Score plot of α-LA at different pH on the first and second principal

components. Principal Component Score plot shows the classification of α-LA spectra

based on the pH levels. PC1 has more native state spectra on left and acidic state spectra on

the right. PC2 separates the intermediates at the top

The loadings plot presented in Figure 5.12 and Figure 5.13 enabled us to classify the

absorbance changes into ranges attributed to the three states. Using the PCA approach,

it was possible to detect the changes that are directly correlated with the Amide I,

Amide II and Amide III regions of the FTIR spectra.(212) These are indicated in the

three positive PC1 peaks around 1656 cm-1

1545 cm-1

and 1280 cm-1

respectively. The

-15 -10 -5 0 5 10 15 20 25 30 35-6

-4

-2

0

2

4

6

8

First Principal Component

Se

co

nd

Prin

cip

al C

om

po

ne

nt

pH 2.0pH 2.5

pH 3.0pH 3.5

pH 4.0

pH 4.5

pH 5.5

pH 7.0

pH 5.0pH 6.5



values are in good agreement with literature reports, where assignments for the three

bands for different proteins are presented.(210, 212, 213)

Figure 5.12. Loading plot shows the most significant wavenumbers that contribute to the

spectral discrimination seen in the score plot in figure 5.13.

Figure 5.13. Loading spectra of PC1 and PC2 shows the most significant wavenumbers that

contribute to the spectral discrimination seen in the score plot in Figure 5.11. Both plots of

Figure 5.12 and 5.13 were combined for the analysis, but can be used separately.

-0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

Loadings for PC1

Loadin

gs for P

C2

1545

1581

1713

1279

13971479

1651

1201

1200 1300 1400 1500 1600 1700 1800-0.1

-0.05

0

0.05

0.1

0.15

Wavenumber in (cm-1

)

Inte

nsity

16561716

1713

1693

1650

15811545

1545

14901397

1280

1272

12211321

1324

1369

1418 1443

1388

1768

1212

Loading for PC1

Loading for PC2


158

As can be seen in Figure 5.15, PC1 depicted 1651 cm-1

from the Amide 1 region and

1546 cm-1

from the Amide II region as significant marker bands on the negative

intensities; these marker bands contribute strongly to pH 7.0 down to 4.0; 1713 cm-1

on

the positive side is significant to pH 3.0 through to pH 2.0; PC2 has also picked up these

bands as 1656 cm-1

for Amide I, 1546 cm-1

for Amide II and 1709 cm-1

respectively.

The peaks of 1693 and 1709 cm-1

are assigned to antiparallel sheets normally seen

with aggregated protein. (214) These can also be seen in the Amide II normalized

spectral plots shown in Figure 5.7 and 5.8.

Another report for secondary structure frequency windows for Amide III are: α-helix,

1328-1289 cm-1

; unordered, 1288-1256 cm-1

; and β-sheets, 1255-1224 cm-1

, as stated by

Fen-Nifu et al.(210) in their study of Amide III FTIR spectra; 1280 cm-1

was picked up

by PC1 of the loading plot while PC2 highlighted 1273 cm-1

as the most significant

peak for the denatured spectra of pH 3.5 to pH 2.0. These wavenumbers fall within the

unordered frequency window listed above. Peaks seen at 1335 cm-1

(PC1) and 1335

cm-1

(PC2) in Figure 5.15 are attributed to α-helix structure based on literature figure

comparison.(210)

Comparison with other methods 5.5.8

The comparison here is not with other quantitative analysis methods since this level of

secondary structure detail of denatured lactalbumin has not been previously reported.

The comparison will be made on the experimental methods. Arai and Kuwajima (215),

in their CD and NMR study of the formation of the molten globule in bovine α-LA, held

the same view as Alexandrescu et al.,(216) that α-LA has similar amount of secondary

structure in its intermediate as in the native molten globule state; however, these



literature did not report on the exact amount of each secondary structure motif at each

stage of the transition. The analytical tools developed in this project now allow such

quantitative information to be obtained.

Paci, et al.(24) described the molten globule state of α-LA, as being stable under acidic

conditions (pH 2). In their CD study of partially unfolded state of a-LA, they found that

α-helix is the best preserved secondary structure, as did this study. The Amide I result

from this study found that α-helix and 310 helix retained values similar to their structural

values of X-ray crystallography but -sheet increased tremendously in the acidic states.

Rosner and Redfield also found, in their NMR study of a-LA at both pH 2.0 and pH 7.0

, that helical structure was more resistant to at pH 2.0 than at pH 7.0.(217)

Bramaud et al.(218) stated that α-LA calcium binding is pH dependent, especially

below pH 4.0; its isoelectric point is between pH 4.2 and pH 4.5 where the

hydrophobicity increases (leading to lower solubility).(219) Lefevre and Subirade(220)

described this process as the breakage of intramolecular bonds of the native structure at

the exposure of the initially hidden hydrophobic core which later cause intramolecular

antiparallel β-sheet formation in an attempt by the protein molecules to hide their

hydrophobic groups from water, this condition is often seen with the denaturation of

proteins. This could explain why in this study, there is a decrease in β-sheet at pH 3.0

where the absence (apo-form) of calcium is exhibited. α-LA has been shown to acquire

an apo-like conformation at around pH 4, (202, 219) it is also important to point out that

the proportion of disordered structure for pH 3.0 was reduced compared to that at pH

4.0. There was, however, a stable structure from pH 5.0 - 4.0.


160

Nozaka, et al.(221) reported a molten globule from denaturant-induced unfolding of α-

LA at near-neutral pH as well as acidic pH. From the multivariate analysis results of

the Amide I region shown in Table 5.1, the secondary structure values at pH 6.5 are the

closest to the DSSP native reference values; the values at pH 2.5 and pH 2.0 cannot be

considered because of the raised β-sheet values.

5.6 Conclusion

FTIR spectra of α-lactalbumin treated with HCl allowed us to study α-LA denatured

states dependent on pH levels; this study was done by means of Multivariate Analysis

(MVA). The results have shown that Multivariate Analysis (MVA) is a suitable tool for

both qualitative and quantitative analysis of protein unfolding.

Exploratory analysis and supervised classification, using Principal Components

Analysis (PCA) and Partial Least Squares revealed that FTIR spectroscopy is able to

correctly discriminate between native-like and non-native states of proteins. PCA, a

form of MVA, was able to extract the wavenumber contributions for different

transitional states of α-LA unfolding and Partial Least Squares (PLS) captured,

quantitatively, the values of the structural conformations of the protein during the

unfolding process. FTIR spectroscopy, together with MVA, is increasingly becoming

an important method to determine secondary structure of proteins.(157) Although CD

spectroscopy has been used extensively in the study of denatured proteins, some motifs

like -sheet, -turns and coils overlap and have relatively small signals. CD signal is

dominated by the -helix component. This makes for a difficult interpretation and CD is

therefore prone to error.(201, 222)



It can be seen from the spectral plots presented in Figures 5.4, 5.6 and 5.7 that there

were peak and shoulder shifts and intensity shifts from one pH value to another; These

losses and gains of the secondary structure show that there was an intermediate during

unfolding.(196) (223) A three state transition (native N, Intermediate I, Unfolded U)

can be seen in these plots and in the PLS results of Tables 5.1, 5.2 and 5.3. Whereas α-

helix values fluctuated a little, perhaps sensitive to the denaturant solvent, β-sheet

values lowered significantly at pH 3.0 (for Amide I); from all indications, there was an

unfolding tertiary structure change at these pH levels.

The predicted values of secondary structure elements of α-LA in native-like state, by

PLS, showed strong agreement with its secondary structure values from X-ray

crystallography. This chapter also describes, in Table 5.4, the identification of Amide I,

Amide II and Amide III bands of native and denatured α-Lactalbumin protein obtained

from our qualitative analysis.

Among all spectral regions arising out of Amide bonds modes, Amide I and Amide III

spectral bands have been reported to be the most sensitive to the variations found in

secondary structure folding.(39) Amide I results from PLS quantitative analysis in

Table 5.1 were more meaningful because it distinguished the three state transition better

than Amide II and Amide III.

Overall, the results presented in this chapter provide strong evidence that the percentage

of each of these secondary structures can be derived at each pH level, and that α-LA has

undergoes structural changes as the pH is reduced towards the acidic states. PCA does

show the spectroscopic marker bands that differentiate between native and non-native


162

proteins. The PCA of the FTIR spectra showed that the three-state transition of α-LA is

the result of the pH dependent conformational changes. PCA was able to detect the

structural changes of the pH dependent transition.

There are evidence that several amyloid diseases are cause by prefibril intermediates;

FTIR spectroscopy is a very powerful tool which can be applied to diluted proteins as

well as to aggregated protein to reveal their molecular details.(186) FTIR

measurements were sensitive to the changes associated with the formation of the A-state

of a-LA(224) and FTIR spectral classification and elucidation of protein secondary

structure by Multivariate Analysis (MVA) can be used to monitor protein

folding/unfolding.



6 Limit of Quantitation of Protein Structure Using PLS

6.1 Introduction

Limit of Quantitation (LoQ) is a term used to describe the lowest concentration of a

sample at which a reliable quantitative result can be measured or detected by an

analytical procedure.(225) This is not to be confused with Limit of Detection (LoD) in

which the measurement focuses on the lowest quantity of a substance that can be

distinguished from the absence of that substance or blank. In the former (LoQ), the

focus is on the quantification method which we have established in this project, to be

Partial Least Squares (PLS); in the latter (LoD), the focus is on the instrumentation used

for the measurement.

Using Multivariate Regression (MVR) methods like Partial Least Squares (PLS)

together with FTIR is becoming a more accepted way of analysing protein secondary

structure in native and denatured states. The goal of this study is to investigate the

potential use of FTIR with MVR method for the elucidation of secondary structure from

low protein concentrations. Some pharmaceutical proteins are formulated at relatively

high concentrations greater than 25 mg/mL when using H2O as a solvent, this provides

good precision, reproducible and quantifiable results;(106) however, 1 – 5 mg/mL has

been reported as the low limit for ATR-FTIR with D2O as a buffer.(226) Many

analytical techniques are not able to deal with very high protein concentrations (50

mg/mL), whereas in FTIR the higher concentrations are actually beneficial to the

analysis.(227) However, for forensic studies where detection of substances in low

concentrations is important(228) (e.g. drug test in urine sample), methods with a low

Limit of Quantitation (LoQ) are needed; for example, illicit drugs use have made it

LIMIT OF QUANTITATION USING FTIR AND PLS

164

necessary to develop a fast and reliable method for identifying and quantifying

substances in low concentrations.(229)

Six different proteins in H2O solution were measured from a high concentration of

200mg/ml down to a low concentration of 0.19 mg/ml and were analysed for their

secondary structure components using the PLS tools described in Chapter 4. These

proteins were chosen for their availabilities and well characterized structural and

spectral properties. Three all α-proteins: BSA (PDBID: 3V03) used to maintain the

osmotic pressure and pH of blood, (230-232) Fibrinogen (PDB ID: 3ghg) is essential for

blood clot formation(233) Lysozyme (PDBID: 2LYZ) can break down bacterial cell

walls and is used for food preservation(234, 235). Three all β-proteins: IgG (PDBID:

3AGV) used for biologic therapies,(236) β-lactoglobulin (PDB ID: 1BEB) a major

protein component in mammalian milk and whey protein,(237) Ubiquitin (PDB ID:

2ZCB) which plays a vital role in protein degradation.(238) These proteins, especially,

lysozyme and BSA are used for laboratory testing of different techniques and standards.

ATR-FTIR measurements of the lowest concentrations for these protein samples were

conducted to test the method used for quantification, therefore, measurements of each

sample were collected until the spectrum of the sample was difficult to detect on the

instrument screen. The objective of this exercise was to test the ability of PLS for

determining protein structure from sample spectra at low concentration and to determine

at what lowest concentration these results were still reliable, hence, the test for LoQ.




Bovine serum albumin (BSA), β-lactoglobulin, fibrinogen, IgG (Human), lysozyme

(hen egg white), and ubiquitin were obtained from Sigma-Aldrich, and were used

without further purification. Each protein was dissolved in H2O to a concentration of

200 mg/mL, and each stock solution was sequentially diluted by 50% until the spectral

intensities of each of the samples was too weak to measure further. Therefore, spectra

were acquired at 200mg/mL, 100mg/mL, 50mg/mL, 25mg/mL, 12.50mg/mL,

6.25mg/mL, 3.13mg/mL, 1.56mg/mL, 0.78mg/mL, 0.39mg/mL, 0.19mg/mL, giving a

total of 11 concentration samples per protein.

6.3 FTIR Spectroscopy

All sample spectra were acquired with ATR-FTIR at the resolution of 4 cm-1

collected

over the range of 900 cm-1

and 4000 cm-1

; water subtraction and baseline correction

were done as described in Chapter 2 for sample measurements. Chapter 2 (section 2.3)

also covers the exact instrument and spectral measurements used for all proteins in this

thesis. Several of the protein spectra at low concentrations below 0.19mg/mL were very

weak, very noisy and sometimes bands could not be detected; therefore in the interests

of consistency no spectra acquired for concentrations below 0.19mg/ml were used.

There were 11 spectra per protein sample giving 66 samples for the six measured

proteins shown in Figure 6.1. Water absorption was subtracted as described in Chapter

2 and the spectra were subject to second derivative using Savistzky-Golay algorithm

with 5 degrees to resolve spectral bands in the Amide 1 region which were used for this

analysis. From quantitative analysis done in Chapter 4, second derivative can be used to

effectively enhance the analysis of the secondary structure of proteins from their FTIR

spectra.


166

Figure 6.1. The raw spectra of six proteins (β-Lactoglobulin, BSA, fibrinogen, IgG, lysozyme,

and ubiquitin) acquired at 200, 100, 50, 25, 12.5, 6.25, 3.13, 1.56, 0.78, 0.39, 0.19 mg/mL.

6.4 PLS Model

As mention in section 6.3, for the investigations of the secondary structure of proteins

from these spectra at different concentrations, the Amide I spectral region was used.

The Amide I regions of the measured ATR-FTIR spectra were normalized and subject

to application of a second derivative before PLS analysis. The 66 samples were used as

an independent test in ‘matrix X’ set for this analysis. The fraction of each secondary

structure motif for each of the reference proteins in ‘matrix Y’ were derived from X-ray

data using the DSSP script. The already established models from the previous analysis

described in Chapter 4, which contain 84 protein spectra in the calibration set, were

1000 1100 1200 1300 1400 1500 1600 1700 1800 1900-0.05

0

0.05

0.10

0.15

0.20

0.25

0.30

Absorb

ance

Wavemumber (cm-1

)



used for structural determination for the different protein concentrations. This model

satisfies the condition described by eq. (1.3) in Chapter 1. The PLS models were

validated, as specified in Chapter 1, with 10-fold (segments) specified. Again, this

algorithm aims to maximize the association between the secondary structure of proteins

in matrix X and their DSSP values in matrix Y while minimizing E (error). The

principles underlying the PLS technique were described in Chapter 1.

6.5 Results and discussion

High quality FTIR spectra of proteins are usually obtained using high protein

concentrations. Dong, et al. (136) reported that higher than 12 mg/mL of protein

concentration is required to obtain good quality infrared spectra, and D’Antonio, et

al.(106) also reported that concentrations at 25mg/ml or higher provides good precision

and reproducibility. In Figure 6.1, it can be seen that the quality of the spectrum

decreases with decrease in concentration, as expected, and this, as Dong, et al. argued in

their study, has made FTIR spectroscopy not quite as useful in the study of protein at

low concentration. It also makes it difficult to compare spectra measured by FTIR and

circular dichroism. With the use of Multivariate Regression (MVR) analysis such as

PLS, the contributions of weaker bands can be equally as significant as those of stronger

bands by using normalization. Spectral data points were scaled to values between 0 and

1 for all spectra, making them compatible with each other and useful for multivariate

analysis.(88)

Proteins used for the analysis of secondary structure in the previous work performed in

this research and presented in the previous chapters were collected at a uniform

concentration of 50mg/mL; this was to ensure high quality spectra. However, the pre-


168

processing methods such as normalization (as seen in Figure 6.2 of the Amide I region)

and second derivative ensures that, even at low concentration, PLS can extract the latent

variables that explain the structural variations in the dataset.(239)

Figure 6.2. Spectra of 66 proteins in the Amide I region in concentrations between 200 mg/mL

and 0.19 mg/mL and scaled for PLS analysis.

Figure 6.3 shows the scaled spectra of Lysozyme acquired at the concentration of 200

mg/mL, 50 mg/mL, and 0.19 mg/mL. While all 3 spectra show peaks at about 1654 cm-

1, there are noticeable difference in the spectra; the spectrum at the lower concentration

(0.19 mg/mL) shows peaks at 1610 cm-1

and 1680 cm-1

and seems to have more

structural variations than the spectra of the 50 mg/mL and 200 mg/mL. The spectrum of

200 mg/mL appears broader than the other two spectra in the plot.

1600 1620 1640 1660 1680 17000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Norm

alized inte

nsity

Wavenumber (cm-1)



Figure 6.3. Comparison of normalized spectra of Lysozyme at concentration of 0.19 mg/mL,

50mg/mL and 200 mg/mL in the Amide I region with major peak at 1654 cm-1

The results for the secondary structure analysis seen in Table 6.1 for Lysozyme reflects

these observations; 0.19 mg/mL results of 0.33 for α-helix and 0.07 for β-sheet were

closer to the reference value of 32% α-helix and 6% β-sheet, which is much better than

better than 200 mg/mL with 0.37 for α-helix and 0.02 for β-sheet.

To further demonstrate the quality of the spectra obtained at the low concentration of

0.19 mg/mL, the second derivative of IgG at the high concentration of 200

mg/mL50mg/mL and 0.19mg/mL were plotted and compared to their normalized

spectra, this can be seen in Figure 6.4a and Figure 6.4b. The peaks of the second

derivative spectra of the three different concentrations are identical to the peaks of

1600 1620 1640 1660 1680 17000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Wavemumber (cm-1

)

No

rma

lize

d in

ten

sity

Lys 0.19 mg/mL

Lys 50.00 mg/mL

Lys 200.00 mg/mL


170

normalized spectra. The spectrum measured at 200 mg/mL, in either plot, shows a peak

at 1630 cm-1

(β-sheet band) (136) while the spectra at 50mg/ml and 0.19 mg/mL show

peaks at 1640 cm-1

. To determine if there is any difference in secondary structure

contents of the proteins based on the different concentrations, PLS results of these

proteins were reviewed and their results are listed in Tables 6.1 – 6.5. The PLS

prediction for IgG shows that the values at 200 mg/mL for α-helix and β-sheet are well

in agreement with the reference value than at 50 mg/mL and 0.19 mg/mL. Unlike β-

lactalbumin or lysozyme, IgG undoubtedly does better with the measurement of

200mg/mL.

Figure 6.4a. The second derivative spectra of the Amide I region of IgG in aqueous solution at

200 mg/mL, 50 mg/mL and 0.19 mg/mL collected with FTIR spectroscopy

1600 1620 1640 1660 1680 1700-7

-6

-5

-4

-3

-2

-1

0

1

2

3x 10

-3

Wavemumber (cm-1

)

d2A

/dv

2

IgG 0.19 mg/mL

IgG 50.00 mg/mL

IgG 200.00 mg/mL



Figure 6.4b. The normalized spectra of the Amide I region of IgG in aqueous solution at 200

mg/mL, 50 mg/mL and 0.19 mg/mL collected with FTIR spectroscopy.

The PLS result for the secondary structure contents of the six proteins used in this

analysis are listed in Tables 6.1 - 6.5. The bar graphs have been colour coded for ease

of review. The left hand black bar for each protein result is the reference value from

DSSP analysis of X-ray crystallographic data for that protein and the white bar in black

outline (50 mg/mL) is the concentration used for the prediction model.

IgG gave the most inconsistent results for the measured spectra. Gorga, et al.(207)

reported that the secondary structure analysis of proteins with low amounts of α-helix

such as IgG can be difficult to determine with CD spectra; however, secondary structure

composition of IgG FTIR spectra from this analysis compared to the corresponding

DSSP-derived reference set values was within ±0.05 accuracy (see Tables 6.1 – 6.5 for

1600 1620 1640 1660 1680 17000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Wavemumber (cm-1)

Norm

aliz

ed inte

nsity

IgG 0.19 mg/mL

IgG 50.00 mg/mL

IgG 200.00 mg/mL


172

secondary structure predictions). The predicted value of α-helix for IgG at 0.19 mg/mL

(α-helix: 0.03, β-sheet: 0.40) was low but the result at 0.39 mg/mL (α-helix: 0.07, β-

sheet: 0.39) obtained from this analysis were in agreement with Circular Dichroism

(CD) results for α-helix (8%) and β-sheet (41.3%) provided by Tetin et al.;(240)

although the concentration used for the CD measurement was not stated, CD spectra

are normally collected at 0.1 - 0.2 mg/mL. β-lactoglobulin results were in reasonable

agreement with X-ray crystallography values for all structural types for concentrations

between 0.19 and 1.56 mg/mL (See Tables 6.1 -6.4 for values). Ubiquitin, at higher

concentrations (0.50 mg/mL, 100 mg/mL, 200 mg/mL) had poorer α-helix results (0.20,

0.19, 0.24) respectively, as compared to the reference value of 0.16. Lysozyme, with a

major peak at 1654 cm-1

due to the presence of α-helix, gave results similar to the

structure of X-ray crystallography (within ±0.05 accuracy) for all concentrations.

Lysozyme results at 0.19 mg/mL (α-helix:0.33, β-sheet: 0.07) compared with X-ray

Crystallography values used for this study (α-helix = 0.32, β-sheet: 0.06) were also

comparable to (Chen, et al.) (241) results given for α-helix and β-sheet content from

CD (α=0.37, β= 0.11) with X-ray crystallography values of (α= 0.41, β=0.16).

Fibrinogen maintained its accurate results for higher concentrations (3.13 mg/mL to

200mg/mL) for the prediction of α-helix and β-sheet but also had very good results from

the lower concentration. These results can be viewed in Tables 6.1 – 6.5. BSA also had

very good result with lower protein concentrations from 0.19mg/mL up to 6.25 mg/mL.

It appears that the all α-proteins gave consistent results for the secondary structure

predictions across the board except for the prediction of β-sheet content. Overall,

results at lower concentrations for FTIR are consistent with that of their reference

protein values from X-ray crystallography and also with CD results for the compared



proteins given above.(241, 242) Overall, the results shown in Tables 6.1 – 6.5 are not

concentration dependent, and this quite evident with 310-helix (Table 6.3) where

sometimes proteins of lower and higher concentrations had more accurate results than

the concentrations in the middle; therefore the randomness seen in the data values may

be due to noise or systematic error due to interference during spectral measurement.


174

α-h

eli

x

Pro

tein

Na

me

PD

B V

alu

e0

.19

mg

/mL

0.3

9 m

g/m

L0

.78

mg

/mL

1.5

6 m

g/m

L3

.13

mg

/mL

6.2

5 m

g/m

L1

2.2

5 m

g/m

L2

5.0

0 m

g/m

L5

0.0

0 m

g/m

L1

00

.00

mg

/mL

20

0.0

0 m

g/m

L

β-L

acto

glo

bulin

0.1

00.1

00.1

00.1

00.1

10.1

10.1

20.0

70.1

10.1

20.1

00.0

8

BS

A0.7

00.6

90.7

10.7

20.7

10.7

20.6

90.7

40.6

40.6

70.7

10.7

0

Fib

rino

gen

(b

ovin

e p

lasm

a)0.7

00.7

50.7

30.7

00.6

60.6

60.6

40.6

70.6

40.7

10.7

00.7

1

IgG

(H

um

an)

0.0

80.0

30.0

70.1

10.0

80.1

20.0

30.0

40.1

20.0

60.1

00.0

8

Lyso

zym

e (C

hic

hen

egg)

0.3

20.3

30.3

10.3

10.3

00.3

40.3

10.3

60.3

60.2

80.3

20.3

7

Ub

iquitin

(b

ovin

e)0.1

60.1

70.1

50.2

00.1

60.1

40.1

10.1

60.1

40.2

00.1

90.2

4

0.0

0

0.1

0

0.2

0

0.3

0

0.4

0

0.5

0

0.6

0

0.7

0

0.8

0

β-L

act

ogl

ob

uli

nB

SAFi

bri

no

gen

(b

ovi

ne

pla

sma)

IgG

(H

um

an)

Lyso

zym

e (C

hic

he

n e

gg)

Ub

iqu

itin

(b

ovi

ne

)

PD

B V

alu

e

0.1

9 m

g/m

L

0.3

9 m

g/m

L

0.7

8 m

g/m

L

1.5

6 m

g/m

L

3.1

3 m

g/m

L

6.2

5 m

g/m

L

12

.25

mg/

mL

25

.00

mg/

mL

50

.00

mg/

mL

10

0.00

mg/

mL

20

0.00

mg/

mL

Tab

le 6

.1. P

LS

res

ult

s of

α-h

elix

co

nte

nt

of

6 p

rote

ins.

F

TIR

sp

ectr

a w

ere

use

d f

or

this

anal

ysi

s. E

ach p

rote

in

spec

tra

was

coll

ecte

d a

t co

nce

ntr

atio

ns

rangin

g f

rom

0.1

9 m

g/m

L t

o 2

00 m

g/m

L.



β-Sh

eet

Pro

tein

Nam

eP

DB

Val

ue0.

19 m

g/m

L0.

39 m

g/m

L0.

78 m

g/m

L1.

56 m

g/m

L3.

13 m

g/m

L6.

25 m

g/m

L12

.25

mg/

mL

25.0

0 m

g/m

L50

.00

mg/

mL

100.

00 m

g/m

L20

0.00

mg/

mL

β-La

ctog

lobu

lin0.

420.

400.

430.

440.

400.

420.

410.

390.

400.

400.

390.

38

BSA

0.00

0.00

0.01

-0.0

30.

01-0

.01

-0.0

7-0

.01

0.04

0.05

0.06

0.06

Fibr

inog

en (b

ovin

e pl

asm

a)0.

020.

070.

07-0

.07

0.07

0.02

0.05

0.03

0.03

0.02

0.03

0.02

IgG

(Hum

an)

0.37

0.40

0.39

0.40

0.33

0.35

0.37

0.39

0.40

0.41

0.39

0.38

Lyso

zym

e (C

hich

en e

gg)

0.06

0.07

0.08

0.08

0.05

0.05

0.03

0.06

0.07

0.06

0.02

0.02

Ubi

quiti

n (b

ovin

e)0.

310.

330.

330.

300.

320.

320.

260.

300.

280.

310.

280.

21

-0.1

0

0.00

0.10

0.20

0.30

0.40

0.50

β-L

acto

glob

ulin

BSA

Fibr

inog

en (

bovi

ne p

lasm

a)Ig

G (H

uman

)L

ysoz

yme

(Chi

chen

egg

)U

biqu

itin

(bov

ine)

PDB

Val

ue

0.19

mg/

mL

0.39

mg/

mL

0.78

mg/

mL

1.56

mg/

mL

3.13

mg/

mL

6.25

mg/

mL

12.2

5 m

g/m

L

25.0

0 m

g/m

L

50.0

0 m

g/m

L

100.

00 m

g/m

L

200.

00 m

g/m

L

Tab

le 6

.2. P

LS

res

ult

s of

β-s

hee

t co

nte

nt

of

6 p

rote

ins.

F

TIR

spec

tra

wer

e use

d f

or

this

an

alysi

s. E

ach

pro

tein

sp

ectr

a w

as c

oll

ecte

d a

t co

nce

ntr

atio

ns

rangin

g f

rom

0.1

9 m

g/m

L t

o 2

00

mg/m

L.


176

31

0-h

elix

Pro

tein

Na

me

PD

B V

alu

e0

.19

mg

/mL

0.3

9 m

g/m

L0

.78

mg

/mL

1.5

6 m

g/m

L3

.13

mg

/mL

6.2

5 m

g/m

L1

2.2

5 m

g/m

L2

5.0

0 m

g/m

L5

0.0

0 m

g/m

L1

00

.00

mg

/mL

20

0.0

0 m

g/m

L

β-L

acto

glo

bul

in0.

060.

060.

060.

060.

060.

060.

050.

050.

050.

050.

050.

04

BS

A0.

040.

040.

040.

050.

040.

030.

030.

040.

050.

050.

050.

05

Fib

rino

gen

(bo

vine

pla

sma)

0.02

0.03

0.02

0.01

0.03

0.03

0.02

0.02

0.01

0.01

0.04

0.03

IgG

(H

uman

)0.

040.

050.

050.

050.

030.

050.

050.

050.

050.

050.

050.

04

Lys

ozy

me

(Chi

chen

egg

)0.

100.

100.

100.

100.

090.

110.

100.

090.

100.

090.

090.

09

Ub

iqui

tin (

bo

vine

)0.

080.

070.

080.

080.

080.

090.

090.

070.

070.

080.

070.

06

0

0.050.

1

0.150.

2

0.250.

3

0.350.

4

0.450.

5

β-La

ctog

lobu

linBS

AFi

brin

ogen

IgG

(Hum

an)

Lyso

zym

e (C

hich

en e

gg)

Ubi

quiti

n

PDB

Valu

e

0.19

mg/

mL

0.39

mg/

mL

0.78

mg/

mL

1.56

mg/

mL

3.13

mg/

mL

6.25

mg/

mL

12.2

5mg/

mL

25.0

0mg/

mL

50.0

0mg/

mL

100.

00m

g/m

L

200.

00m

g/m

L

0.00

0.02

0.04

0.06

0.08

0.10

0.12

β-La

ctog

lobu

linBS

AFi

brin

ogen

(bov

ine

plas

ma)

IgG

(Hum

an)

Lyso

zym

e (C

hich

en e

gg)

Ubi

quiti

n (b

ovin

e)

PDB

Valu

e

0.19

mg/

mL

0.39

mg/

mL

0.78

mg/

mL

1.56

mg/

mL

3.13

mg/

mL

6.25

mg/

mL

12.2

5 m

g/m

L

25.0

0 m

g/m

L

50.0

0 m

g/m

L

100.

00 m

g/m

L

200.

00 m

g/m

L

Tab

le 6

.3. P

LS

res

ult

s of

31

0-h

elix

conte

nt

of

6 p

rote

ins.

F

TIR

spec

tra

wer

e u

sed f

or

this

anal

ysi

s.

Eac

h p

rote

in s

pec

tra

was

coll

ecte

d a

t co

nce

ntr

atio

ns

ran

gin

g f

rom

0.1

9 m

g/m

L t

o 2

00

mg/m

L.



Tu

rns

Pro

tein

Na

me

PD

B V

alu

e0

.19

mg

/mL

0.3

9 m

g/m

L0

.78

mg

/mL

1.5

6 m

g/m

L3

.13

mg

/mL

6.2

5 m

g/m

L1

2.2

5 m

g/m

L2

5.0

0 m

g/m

L5

0.0

0 m

g/m

L1

00

.00

mg

/mL

20

0.0

0 m

g/m

L

β-L

acto

glo

bul

in0.

120.

120.

120.

120.

120.

120.

120.

100.

100.

100.

100.

10

BS

A0.

090.

090.

090.

090.

080.

090.

100.

070.

080.

100.

100.

11

Fib

rino

gen

(bo

vine

pla

sma)

0.06

0.06

0.06

0.06

0.06

0.05

0.06

0.06

0.06

0.06

0.06

0.08

IgG

(H

uman

)0.

080.

090.

100.

100.

090.

080.

070.

090.

100.

080.

100.

10

Lys

ozy

me

(Chi

chen

egg

)0.

240.

240.

240.

240.

230.

240.

240.

240.

240.

240.

240.

24

Ub

iqui

tin (

bo

vine

)0.

160.

150.

160.

160.

160.

160.

170.

160.

160.

160.

160.

14

0

0.050.1

0.150.2

0.250.3

β-L

acto

glob

ulin

BS

AF

ibri

nog

en (

bovi

ne p

lasm

a)Ig

G (

Hum

an)

Lys

ozym

e (C

hic

hen

egg

)U

biqu

itin

(bo

vin

e)

PD

B V

alu

e

0.19

mg/

mL

0.39

mg/

mL

0.78

mg/

mL

1.56

mg/

mL

3.13

mg/

mL

6.25

mg/

mL

12.2

5 m

g/m

L

25.0

0 m

g/m

L

50.0

0 m

g/m

L

100.

00 m

g/m

L

200.

00 m

g/m

L

Tab

le 6

.4. P

LS

res

ult

s of

Turn

s co

nte

nt

of

6 p

rote

ins.

F

TIR

spec

tra

wer

e use

d f

or

this

an

alysi

s.

Eac

h p

rote

in

spec

tra

was

coll

ecte

d a

t co

nce

ntr

atio

ns

rangin

g f

rom

0.1

9 m

g/m

L t

o 2

00 m

g/m

L.


178

Table 6.6 lists the R2 and overall root-mean-square error for prediction (δ); estimated (δ)

are less than 0.04 for 66 protein samples (11 different concentrations per protein). This

Oth

er

Pro

tein

Na

me

PD

B V

alu

e0

.19

mg

/mL

0.3

9 m

g/m

L0

.78

mg

/mL

1.5

6 m

g/m

L3

.13

mg

/mL

6.2

5 m

g/m

L1

2.2

5 m

g/m

L2

5.0

0 m

g/m

L5

0.0

0 m

g/m

L1

00

.00

mg

/mL

20

0.0

0 m

g/m

L

β-L

acto

glo

bul

in0.

290.

290.

290.

290.

310.

290.

300.

350.

360.

360.

350.

35

BS

A0.

160.

160.

150.

160.

150.

170.

150.

150.

170.

160.

170.

16

Fib

rino

gen

(bo

vine

pla

sma)

0.21

0.22

0.21

0.20

0.21

0.20

0.21

0.23

0.21

0.21

0.22

0.21

IgG

(H

uman

)0.

420.

430.

350.

360.

420.

400.

400.

420.

360.

430.

350.

35

Lys

ozy

me

(Chi

chen

egg

)0.

280.

270.

280.

280.

280.

280.

290.

280.

290.

280.

270.

27

Ub

iqui

tin (

bo

vine

)0.

280.

290.

290.

280.

280.

290.

250.

280.

280.

290.

270.

26

0

0.050.

1

0.150.

2

0.250.

3

0.350.

4

0.450.

5

β-La

ctog

lobu

linBS

AFi

brin

ogen

(bov

ine

plas

ma)

IgG

(Hum

an)

Lyso

zym

e (C

hich

en e

gg)

Ubi

quiti

n (b

ovin

e)

PDB

Valu

e

0.19

mg/

mL

0.39

mg/

mL

0.78

mg/

mL

1.56

mg/

mL

3.13

mg/

mL

6.25

mg/

mL

12.2

5 m

g/m

L

25.0

0 m

g/m

L

50.0

0 m

g/m

L

100.

00 m

g/m

L

200.

00 m

g/m

L

Tab

le 6

.5. P

LS

res

ult

s of

Dis

ord

ered

conte

nt

of

6 p

rote

ins.

F

TIR

spec

tra

wer

e use

d f

or

this

anal

ysi

s.

Eac

h p

rote

in s

pec

tra

was

coll

ecte

d a

t co

nce

ntr

atio

ns

ran

gin

g f

rom

0.1

9 m

g/m

L t

o 2

00

mg/m

L.



is a good prediction error of the secondary structure analysis of the proteins in the test

set. Figure 6.5 and 6.6 show the line of fit from the analysis.

Table 6.6. R2 and RMSE (δ) of PLS analysis of secondary structure contents determined from

FTIR spectra in the Amide I regions.

α-helix β-Sheet 310 helix Turns Disordered

R2 0.98 0.96 0.88 0.97 0.88

δ 0.039 0.031 0.009 0.010 0.026

Figure 6.5. Plot of predicted α-helix content from FTIR spectra of concentrations between 0.19

mg/mL and 200mg/mL in comparison to the content measured from X-ray crystallography. The

Amide I region (1600 – 1700 cm-1

) was used for the PLS analysis.

0 0.2 0.4 0.6 0.8 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-helix test set

-h

elix p

redic

ted

BSA

Fibrinogen

Lysozyme

Ubiquitin

IgG

-lactalbumin


180

Figure 6.6. Shows β-sheet structure predicted from FTIR spectra of concentrations between

0.19 mg/mL and 200mg/mL vs. X-ray crystallography. Amide I region (1600 – 1700 cm-1

) was

used for the PLS analysis.

6.6 Conclusion

The results presented in this work show the potential usefulness of ATR-FTIR with PLS

in determining protein secondary structure from low concentrations of samples. It is

important to analyse methods for their performance and limitations in order to ensure

that they are fit for purpose (225, 228); in this case, using multivariate regression

analysis on decreased sample concentrations to determine the method’s usefulness.

It was difficult to acquire protein measurements below 0.2 mg/mL for some of the

proteins used for this study because the signal to noise decreases as the protein

concentration decreases, and at such low concentration, the spectra are unusable;

therefore, results are focused on protein concentrations between 0.19 mg/mL and 200

mg/mL. To acquire good spectra by conventional FTIR method, protein concentrations

0 0.2 0.4 0.6 0.8 1-0.5

0

0.5

1

1.5

-sheet test set

-sheet predic

ted

-lactalbumin

IgG

Ubiquitin

Lysozyme

Fibrinogen

BSA



above 10 mg/mL are normally used for proteins in H2O solutions due to the strong

absorption of water molecule in the Amide I region;(13, 136) However, technical

advances in FTIR spectroscopy have made it possible to routinely acquire protein

spectra from aqueous solutions at concentrations reportedly as low as 0.5mg/mL with

good signal to noise ratio,(136) therefore making it possible to compare FTIR results to

CD which is a well-established technique for determining the secondary structure of

proteins in solution. In addition, CD spectroscopy is limited by the strong overlap of β-

sheet, β-turns and unordered structure.(207)

Results from this study gave a RMSE of less than 0.04 for α-helix and β-sheet

prediction for samples collected at concentrations between 0.19 mg/mL and 200

mg/mL. These results are more accurate than the CD prediction results with RMSE of

0.06 – 0.1 for both α-helix and β-sheet (sample concentrations collected at 0.1 – 2.0

mg/mL).(136, 175, 177, 243, 244)

Again, this study was not done to determine the limit of detection of protein spectra by

ATR-FTIR, rather, it was done to determine how well multivariate regression analysis

methods like PLS can determine secondary structures from spectra of low

concentrations. The results derived from this study show that the combination of

second derivative and PLS analysis of FTIR spectra of proteins, has made it possible to

resolve peak intensities and statistically derive their secondary structure conformations

at even low concentrations.

CONCLUSION

182

7 Conclusion

This thesis has demonstrated that spectroscopic techniques can be used in conjunction

with multivariate analytical tools in order to achieve great results for the structural

analysis of proteins in solutions. Even though X-ray crystallography and NMR are used

to solve the three dimensional structures of proteins, they have their limitations; protein

crystallography is time consuming and many proteins do not readily crystalize, and protein

size limitation is an issue for NMR, besides, NMR is time consuming.(33, 210, 245,

246) Fourier transform infrared (FT-IR) spectroscopy is a rapid spectroscopic

technique that can give fingerprints of protein structures in solution, solid or gas phase.

It can be used to reduce the sample analysis period significantly due to its rapid spectral

measurement.(36, 121) However, FTIR spectra suffer from interfering background

signals that must be reduced or removed before any data analysis.(247, 248) There are

many methods used to remove background effects from the spectra, these include some

of the methods used in this thesis: baseline correction, Multiplicative Scattered

Correction (MSC), Standard Normal Variate (SNV), and Second Derivative. There is

also a need to analyse the underlying biological differences between the identified FTIR

spectral regions and also to determine their secondary structures.(69, 249) Methods that

have been used to determine protein structure include band deconvolution and neural

networks (ANN) but their models are difficult to interpret.(250) MVA has proved to be

an appropriate tool to carry out the quantification of protein secondary structure from

their infrared spectra because of its simplicity and ability to reduce the data to a lower

dimension and extract the relevant information.(251, 252)



Myra Kinalwa(177) used similar methods on Raman spectra and achieved very good

results especially with the Amide I, II and III regions combined; however, the size of

the protein dataset was too small to validate the model with an independent test set.

Raman spectroscopy, a technique based on inelastic scattering of monochromatic light,

is making its mark as another vibrational spectroscopic method that has been

successfully used in the determination of the secondary structure of protein.(137, 253)

FTIR and Raman techniques provide complementary information; while Raman

spectroscopy may not have the H2O absorption in the Amide I region found in FTIR, its

signal to noise is inferior to FTIR absorption spectra and the quantitative results from

FTIR method are better.(254, 255)

In the first result chapter 3, FTIR spectra were obtained for proteins dissolved in H2O.

The absorbance spectra of predominantly α-helical, β-sheet and mixture of α and β

proteins were measured and analysed. This analysis made use of the strength of

Principal Component Analysis (PCA), an MVA classification method, on ATR-FTIR

spectra for classification of proteins into their natural groupings of All α, All β, α/β,

α+β, as defined by SCOP. SCOP (release 1.75) currently has 110800 Domains protein

domains classified into 1195 folds; this means there are 92.72 domains per fold.

Performing data classification on the SCOP folds will need lots of data; (256, 257)

therefore the analysis was limited to the protein class structure.

The variations in the samples used for this analysis were described by 4 PCs giving a

95.39% of the total variations in the data. The protein spectra classifications showed

strong groupings of proteins of α and β classes and those of mixed types not in any of

the two groups. The obtained PCA results using FTIR spectra were comparable or

CONCLUSION

184

even better than CD methods currently used for protein classification. ATR-FTIR

absorption spectra contain a great deal of information that can be utilized for

classification and identity determination of proteins and other biologic systems; for

example, qualitative analysis tools like PCA can be used in the identification of

biomarkers of several diseases like cancer, Alzheimer’s and other autoimmune disorders

due to changes in the secondary structure of their biomolecules.(258, 259)

In the second result Chapter 4, ATR-FTIR spectroscopy in combination with partial

least squares (PLS), a multivariate regression methods, was used to calibrate and

quantify different proteins in H2O; the calibration dataset covered Amide I, II and III

(1700cm – 1200 cm-1) regions of the protein spectra. The ATR-FTIR spectra showed

very good predictive value for the calibration (to within 0.01 to 0.05 RMSE) and the

independent test set prediction (from 0.01 – 0.12 RMSEP). Spectral features linked to

α-helix and β-sheet had excellent structural determination; structural features like 310

helix, turns and random coils were also elucidated with good results as listed in Chapter

4 and Appendix B of this thesis.

A version of PLS called interval PLS (iPLS) uses subintervals of the full spectrum.

Local models are built with intervals of the spectral region; the performance of the local

models is compared to that of the full spectrum model. By using intervals of the

spectrum, regions specific to each secondary structure motif can be identified. This use

of only the significant spectral regions can reducing spectral measurement time;

however, selection of the number of intervals that can yield an optimum result for each

motif can be time consuming. The iPLS result in this analysis, although useful, was not

as good as that of the traditional PLS model.



Many researchers focused their analysis on α-helix and β-sheet contents because these

spectral features are better abundant than 310-helix other structural types.(150) With

the use of second derivative, underlying spectral peaks can be resolved to identify their

positions whereby aiding in the multivariate regression analysis of different structural

motifs. One of the uses of statistical quantitative analysis in the biomedical field is in

pharmaceuticals to confirm their structural conformation from manufacturing to

distributions, including their shelf lives.(64, 69) Multivariate regression analysis can be

also combined with X-ray crystallography or NMR spectroscopy to derive information

from their protein measurements due to the reliability and accuracy of the statistical

method.(260) Three key advantages of MVA is the small sample size needed to train

the model, the method’s ability to handle multivariate data, and missing data in the

spectra;(261) however, PLS has a known fault of being sensitive to the number of

components used in the model; increasing the number of components or descriptors in

the model improves it’s fit but decreases its ability to fit new data well.(262) (Zeng and

Li)(263) pointed out that PLS cannot handle continuous streams data as all of the data

are loaded into memory. This incremental PLS for continuous streams of should help in

the analysis of large data sets and real-time data.

Chapter 5 illustrated the strength of using FTIR to distinguish between native and

unfolding proteins between pH 2.0 and pH 7.0 in increments of 0.5 pH levels. The PLS

calibration model established in Chapter 4 was used to elucidate the secondary structure

of the denatured α-Lactalbumin (α-LA) proteins at each pH. The second derivative

spectra showed important spectral changes that occurred during denaturation. Result

from the analysis in this study showed the structure of α-LA at the native state, the

absence of Ca2+

binding molecule (the apo-protein) where an intermediate form and the

structure of β-sheet diminished, and finally the formation of intramolecular anti-parallel

CONCLUSION

186

β-sheet in the acidic (A-state) protein. Proteins need to be in their native state to operate

efficiently, therefore; studying the changes in their aggregation stages may help in

understanding how proteins misfold.(183, 203) A related study done with Raman

spectroscopy using 2Dimensional Correlation (2DCOS) to study the spectral intensity

variations of α-lactalbumin within the pH ranges of 2 and 7 was successfully carried out

by Ashton and Blanch(254). The 2DCOS method was used to identify secondary

structure variations occurring in the different FTIR spectral bands of denatured α-LA.

In Chapter 6, the calibration model was applied to proteins of various concentrations to

determine their secondary structure. The main aim of this Chapter is to test the model

for its limit of quantification. It is clear that some proteins absorb well at low

concentrations; in general, multivariate regression analysis does a good job with

proteins at concentrations as low as 0.20 mg/mL. This method serves as a good

detection tool for forensic and other biomedical needs of detecting substances at low

concentrations.

One other popular types of multivariate analysis method used in the elucidation of

protein is Artificial Neural Network (ANN).(252) ANN is also popular for determining

the secondary structure of proteins from spectra; however, the methodology is not as

fast as PLS and not as easy to learn, moreover, overfitting can be stronger in neural

networks than in PLS because of the layers or neurons used for the analysis.(253)

In summary, FTIR has advantages over other widely used structural spectroscopic

methods like CD and NMR; FTIR is fast and easy to use and require very little sample.

The four result chapters in this thesis show how multivariate statistical analytical tools

like PLS and PCA can be used to obtain structural contents from ATR-FTIR spectra.



These results show that FT-IR spectroscopy can extend its applications in the

biotechnology industry as a rapid method for detection and quantification of substances,

especially for proteins in solution.

APPENDIX

188

8 References

1. Whitford D. Proteins: Structure and function. Hoboken, N.J.: John Wiley &

Sons; 2013.

2. Lehninger AL, Nelson DL, Cox MM. Lehninger principles of biochemistry.

New York, N.Y.: W. H. Freeman; 2008.

3. McMurry J. Fundamentals of organic chemistry: Cengage Learning; 2010.

4. Purves WK. Life, the science of biology: Sinauer Associates; 2004.

5. Clayden J, Greeves N, Warren S, P. W. Organic chemistry, Oxford, England:

Oxford University; 2000.

6. Degrève L, Fuzo CA, Caliri A. Extended secondary structures in proteins.

Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics.

2014;1844(2):384-388.

7. Eidhammer I, Jonassen I, Taylor WR. Protein bioinformatics: An algorithmic

approach to sequence and structure analysis: J. Wiley & Sons; 2004.

8. Petsko GA, Ringe D. Protein structure and function: New Science Press;

2004.

9. Brändén CI, Tooze J. Introduction to protein structure: Garland Pub.; 1999.

10. Alberts B, Johnson A, Lewis J, Raff M, Robert K, et al. Molecular biology of

the cell. 4th. ed. New York: Garland Science; 2002.

11. Vieira-Pires RS, Morais-Cabral JH. 310-helices in channels and other

membrane proteins. The Journal of General Physiology. 2010;136(6):585-

592.

12. Barlow DJ, Thornton JM. Helix geometry in proteins. Journal of Molecular

Biology. 1988;201(3):601-619.

13. Kong J, Yu S. Fourier transform infrared spectroscopic analysis of protein

secondary structures. Acta Biochimica et Biophysica Sinica. 2007;39(8):549-

559.

14. Pain RH. Mechanisms of protein folding: Oxford University Press; 2000.

15. Heller BA, Gindt YM. A biochemical study of noncovalent forces in proteins

using phycocyanin from spirulina. Journal of Chemical Education.

2000;77(11):1458.

16. Fold classification based on structure-structure alignment of proteins.



17. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM.

Cath – a hierarchic classification of protein domain structures. Structure.

1997;5(8):1093-1109.

18. Murzin AG, Brenner SE, Hubbard T, Chothia C. Scop: A structural

classification of proteins database for the investigation of sequences and

structures. Journal of Molecular Biology. 1995;247(4):536-540.

19. Yu X, Wang C, Li Y. Classification of protein quaternary structure by

functional domain composition. BMC Bioinformatics. 2006;7(1):187.

20. Squire J, Parry DAD. Fibrous proteins: Muscle and molecular motors:

Elsevier; 2005.

21. Ruso JM, Attwood D, García M, Prieto G, Sarmiento F, Taboada P, et al.

Interaction of amphiphilic propranolol hydrochloride with haemoglobin and

albumin in aqueous solution. Langmuir : The ACS Journal of Surfaces and

Colloids. 2000;16(26):10449-10455.

22. Richter JD. Mrna formation and function: Elsevier Science; 1997.

23. Permiakov EA, Permi akov EA. Alpha-lactalbumin: Nova Science Publishers;

2005.

24. Paci E, Smith LJ, Dobson CM, Karplus M. Exploration of partially unfolded

states of human α-lactalbumin by molecular dynamics simulation. Journal of

Molecular Biology. 2001;306(2):329-347.

25. Spiro RG. Protein glycosylation: nature, distribution, enzymatic formation,

and disease implications of glycopeptide bonds. Glycobiology.

2002;12(4):43R-56R.

26. Ovádi J, Orosz F. Protein folding and misfolding: Neurodegenerative

diseases: Neurodegenerative diseases: Springer; 2008.

27. Shenolikar S. Protein phosphorylation in health and disease: Elsevier

Science & Technology Books; 2012.

28. Martin I, Kim JW, Lee BD, Kang HC, Xu JC, Jia H, et al. Ribosomal protein

s15 phosphorylation mediates lrrk2 neurodegeneration in parkinson's disease.

Cell. 2014;157(2):472-485.

29. Cohen P. The role of protein phosphorylation in human health and disease. .

European Journal of Biochemistry. 2001;268(19):5001-5010.

30. Muller P, Ruckova E, Halada P, Coates PJ, Hrstka R, Lane DP, et al. C-

terminal phosphorylation of hsp70 and hsp90 regulates alternate binding to

co-chaperones chip and hop to determine cellular protein folding/degradation

balances. Oncogene. 2013;32(25):3101-3110.

31. Dobson CM. Protein folding and misfolding. Nature. 2003;426(6968):884-

890.

APPENDIX

190

32. Seddon AM, Curnow P, Booth PJ. Membrane proteins, lipids and detergents:

Not just a soap opera. Biochimica et Biophysica Acta (BBA) - Biomembranes.

2004;1666(1–2):105-117.

33. Barton GJ. Protein secondary structure prediction. Current Opinion in

Structural Biology. 1995;5(3):372-376.

34. Jackson M, Mantsch H. The use and misuse of FTIR spectroscopy in the

determination of protein structure. Critical Reviews in Biochemistry and

Molecular Biology. 1995;30(2):95-120.

35. Smith BC. Fundamentals of Fourier transform infrared spectroscopy, second

edition: Taylor & Francis; 2011.

36. Haris PI, Severcan F. FTIR spectroscopic characterization of protein structure

in aqueous and non-aqueous media. Journal of Molecular Catalysis B:

Enzymatic. 1999;7(1–4):207-221.

37. Magazu S, Calabro E. Studying the electromagnetic-induced changes of the

secondary structure of bovine serum albumin and the bioprotective

effectiveness of trehalose by Fourier transform infrared spectroscopy. The

Journal of Physical Chemistry B. 2011;115(21):6818-6826.

38. Smith E, Dent G. Modern Raman spectroscopy: A practical approach:

Wiley; 2013.

39. Cai S, Singh BR. Identification of β-turn and random coil amide iii infrared

bands for secondary structure estimation of proteins. Biophysical Chemistry.

1999;80(1):7-20.

40. Dunuwila DD, Berglund KA. Identification of infrared spectral features

related to solution structure for utilization in solubility and supersaturation

measurements. Organic Process Research & Development. 1997;1(5):350-

354.

41. Dousseau F, Pezolet M. Determination of the secondary structure content of

proteins in aqueous solutions from their amide i and amide ii infrared bands.

Comparison between classical and partial least-squares methods.

Biochemistry. 1990;29(37):8771-8779.

42. Stuart BH. A FFourier transform infrared spectroscopic study of the

secondary structure of myelin basic protein in reconstituted myelin.

Biochemistry and molecular biology international. 1996;38(4):839-845.

43. Davis R, Mauer IJ. Fourier transform infrared (FTIR) spectroscopy: A rapid

tool for detection and analysis of foodborne pathogenic bacteria. In: A. M-V,

editor. Current research, technology and education topics in applied

microbiology and microbial biotechnology. 2: ormatex Research Center;

2010.582-1594.

44. Nishikida K, Nishio E, Hannah RW. Selected applications of modern ft-ir

techniques: Taylor & Francis; 1996.



45. Pretsch E, Bühlmann P, Badertscher M. Structure determination of organic

compounds: Tables of spectral data: Springer; 2009.

46. Ramer G, Lendl B. Attenuated total reflection Fourier transform infrared

spectroscopy. Encyclopedia of Analytical Chemistry: John Wiley & Sons,

Ltd; 2006.

47. Schrader B. Infrared and Raman Spectroscopy: Methods and Applications:

Wiley; 2008.

48. Glassford SE, Byrne B, Kazarian SG. Recent applications of ATR-FTIR

spectroscopy and imaging to proteins. Biochimica et Biophysica Acta (BBA) -

Proteins and Proteomics. 2013;1834(12):2849-2858.

49. Goldberg ME, Chaffotte AF. Undistorted structural analysis of soluble

proteins by attenuated total reflectance infrared spectroscopy. Protein

Science. 2005;14(11):2781-2792.

50. Arrondo JLR, Goñi FM. Structure and dynamics of membrane proteins as

studied by infrared spectroscopy. Progress in Biophysics and Molecular

Biology. 1999;72(4):367-405.

51. Raman S, Lange OF, Rossi P, Tyka M, Wang X, Aramini J, et al. NMR

structure determination for larger proteins using backbone-only data. Science.

2010;327(5968):1014-1018.

52. Wu TD, Schmidler SC, Hastie T, Brutlag DL. Journal of Computational

Biology. 1998;5(3):585-595.

53. Efimova YM, Haemers S, Wierczinski B, Norde W, Well AAv. Stability of

globular proteins in h2o and d2o. Biopolymers. 2007;85(3):264-273.

54. Keiderling TA, Wang B, Urbanova M, Pancoska P, Dukor RK. Empirical

studies of protein secondary structure by vibrational circular dichroism and

related techniques. Alpha-lactalbumin and lysozyme as examples. Faraday

Discussions. 1994(99):263-285

55. Downer NW, Bruchman TJ, Hazzard JH. Infrared spectroscopic study of

photoreceptor membrane and purple membrane. Protein secondary structure

and hydrogen deuterium exchange. J Biol Chem. 1986;261(8):3640-3647.

56. Joosten R, te Beek T, Krieger E, Hekkelman M, Hooft R, Schneider R, et al.

A series of pdb related databases for everyday needs 2011 [cited 2013 12

Dec]. Available from: http://dx.doi.org/10.1093/nar/gkq1105.

57. Strasfeld DB, Ling YL, Shim S-H, Zanni MT. Tracking fiber formation in

human islet amyloid polypeptide with automated 2d-ir spectroscopy. Journal

of the American Chemical Society. 2008;130(21):6698-6699.

58. Pirovano W, Heringa J. Protein secondary structure prediction. Methods in

Molecular Biology. 2010;609:327-348.

http://dx.doi.org/10.1093/nar/gkq1105

APPENDIX

192

59. Wang F, Liu AP, Ren FZ, Zhang XY, Stephanie C, Zhang LD, et al. FTIR

analysis of protein secondary structure in cheddar cheese during ripening.

Guang pu. 2011;31(7):1786-1789.

60. Higuchi II, Eguchi S. The influence function of principal component analysis

by self-organizing rule. Neural Computation. 1998;10(6):1435-1444.

61. Candolfi A, De Maesschalck R, Jouan-Rimbaud D, Hailey PA, Massart DL.

The influence of data pre-processing in the pattern recognition of excipients

near-infrared spectra. Elsevier. 1999. 115-132.

62. Ge Y-S, Jin C, Song Z, Zhang J-Q, Jiang F-L, Liu Y. Multi-spectroscopic

analysis and molecular modeling on the interaction of curcumin and its

derivatives with human serum albumin: A comparative study. Spectrochimica

Acta Part A: Molecular and Biomolecular Spectroscopy. 2014;124(0):265-

276.

63. Wold S. Chemometrics, why, what and where to next? Journal of

Pharmaceutical and Biomedical Analysis. 1991;9(8):589-596.

64. Brereton RG. Applied chemometrics for scientists: Wiley; 2007.

65. Jorgensen A. Clustering excipient near infrared spectra using different

chemometric methods Dept. of Pharmacy, University of Helsinki2000 [cited

2014 12 Jan.]. Available from:

http://www.pharmtech.helsinki.fi/seminaarit/vanhat/annajorgensen.pdf.

66. Yao F, Coquery J, Le Cao KA. Independent principal component analysis for

biologically meaningful dimension reduction of large biological data sets.

BMC Bioinformatics. 2012;13:24.

67. Uceda Otero EP, Eliel GSN, Fonseca EJS, Hickmann JM, Rodarte R, Barreto

E, et al. Study of cancer cell lines with Fourier transform infrared

(FTIR)/vibrational absorption (va) spectroscopy. Current Physical Chemistry.

2013;3(1).

68. Rusak DA, Brown LM, Martin SD. Classification of vegetable oils by

principal component analysis of FTIR spectra. Journal of Chemical

Education. 2003;80(5):541.

69. Ami D, Doglia SM, Mereghetti P. Multivariate analysis for Fourier

transform infrared spectra of complex biological systems and processes.

InTech. 2013.

70. Hemmateenejad B, Akhond M, Samari F. A comparative study between pcr

and pls in simultaneous spectrophotometric determination of diphenylamine,

aniline, and phenol: Effect of wavelength selection. Spectrochimica Acta Part

A: Molecular and Biomolecular Spectroscopy. 2007;67(3–4):958-965.

71. Phatak A. A user-friendly guide to multivariate calibration and

classificationtion, t. Næs, t. Isaksson, t. Fearn, t. Davies: Chichester: Nir

http://www.pharmtech.helsinki.fi/seminaarit/vanhat/annajorgensen.pdf



publications. Chemometrics and Intelligent Laboratory Systems.

2004;71(1):79-81.

72. Hori R, Sugiyama J. A combined FTIR microscopy and principal component

analysis on softwood cell walls. Carbohydrate Polymers. 2003;52(4):449-

453.

73. David CC, Jacobs DJ. Principal component analysis: A method for

determining the essential dynamics of proteins. Methods in Molecular

Biology. 2014;1084:193-226.

74. Shojaei M, Mirmohseni A, Farbodi M. Application of a quartz crystal

nanobalance and principal component analysis for the detection and

determination of histidine. Analytical and Bioanalytical Chemistry.

2008;391(8):2875-2880.

75. Wold S, Esbensen K, Geladi P. Principal component analysis. Chemometrics

and Intelligent Laboratory Systems. 1987;2(1–3):37-52.

76. Burstyn I. Principal component analysis is a powerful instrument in

occupational hygiene inquiries. The Annals of Occupational Hygiene.

2004;48(8):655-661.

77. Carey RN, Wold S, Westgard JO. Principal component analysis: An

alternative to "referee" methods in method comparison studies. Analytical

Chemistry. 1975;47(11):1824-1829.

78. Campins-Falco P, Blasco-Gomez F, Bosch-Reig F, Gallo-Martinez L.

Principal component analysis for the selection of variables in the application

of the h-point and generalised h-point standard addition method. Talanta.

2000;53(2):317-330.

79. Varnek A, Baskin I. Machine learning methods for property prediction in

chemoinformatics: Quo vadis? Journal of Chemical Information and

Modeling. 2012;52(6):1413-1437.

80. Bai L, Gao B, Tian S, Cheng Y, Chen Y, Tian GY, et al. A comparative study

of principal component analysis and independent component analysis in eddy

current pulsed thermography data processing. The Review of Scientific

instruments. 2013;84(10):104901.

81. Craig A, Merlic CA. Introduction to ir spectra: UCLA chemistry and

Biochemistry 1997 [cited 2010 20th Nov]. Available from: Available from:

http://www.chem.ucla.edu/~webspectra/irintro.html.

82. Delwiche SR, Reeves JB, 3rd. A graphical method to evaluate spectral

preprocessing in multivariate regression calibrations: Example with savitzky-

golay filters and partial least squares regression. Applied Spectroscopy.

2010;64(1):73-82.

http://www.chem.ucla.edu/~webspectra/irintro.html

APPENDIX

194

83. Berg R, Hoefsloot H, Westerhuis J, Smilde A, Werf M. Centering, scaling,

and transformations: Improving the biological information content of

metabolomics data. BMC Genomics. 2006;7(1):1-15.

84. Noda I. Scaling techniques to enhance two-dimensional correlation spectra.

Journal of Molecular Structure. 2008;883–884(0):216-227.

85. Karpievitch Y, Dabney A, Smith R. Normalization and missing value

imputation for label-free lc-ms analysis. BMC Bioinformatics. 2012;13(Suppl

16):S5.

86. Lacher DA. Interpretation of laboratory results using multidimensional

scaling and principal component analysis. Annals of Clinical and Laboratory

Science. 1987;17(6):412-417.

87. McPhie P. Concentration-independent estimation of protein secondary

structure by circular dichroism: A comparison of methods. Anal Biochem.

2008;375(2):379-381.

88. Craig A, Cloarec O, Holmes E, Nicholson JK, Lindon JC. Scaling and

normalization effects in NMR spectroscopic metabonomic data sets.

Analytical Chemistry. 2006;78(7):2262-2267.

89. Martens H, Naes T. Multivariate calibration: Wiley; 1989.

90. Perez-Guaita D, Kuligowski J, Quintas G, Garrigues S, Guardia Mde L.

Modified locally weighted--partial least squares regression improving clinical

predictions from infrared spectra of human serum samples. Talanta.

2013;107:368-375.

91. Rinnan Å, Berg Fvd, Engelsen SB. Review of the most common pre-

processing techniques for near-infrared spectra. TrAC Trends in Analytical

Chemistry. 2009;28(10):1201-1222.

92. Kohler U, Luniak M. Data inspection using biplots. Stata Journal.

2005;5(2):208-223.

93. Ben-Gal I. Outlier detection. In: Maimon O, Rockach L, editors. Data mining

and knowledge discovery handbook: A complete guide for practitioners and

researchers: Kluwer Academic Publishers; 2005.1-12.

94. Aggarwal CC, Yu PS. Outlier detection for high dimensional data. SIGMOD

Rec. 2001;30(2):37-46.

95. Hawkins DM. Identification of outliers: Chapman and Hall; 1980.

96. Mark H, Workman J. Chemometrics in spectroscopy: Elsevier Science; 2010.

97. Ramachandran BR, Halpern AM. An experiment in electronic spectroscopy:

Information enhancement using second derivative analysis. Journal of

Chemical Education. 1998;75(2):234.



98. Wang Q, Zhang R. Statistical estimation in varying coefficient models with

surrogate data and validation sampling. Journal of Multivariate Analysis.

2009;100(10):2389-2405.

99. Faber NM. Estimating the uncertainty in estimates of root mean square error

of prediction: Application to determining the size of an adequate test set in

multivariate calibration. Chemometrics and Intelligent Laboratory Systems.

1999;49(1):79-89.

100. Al-Ghouti MA, Al-Degs YS, Amer M. Determination of motor gasoline

adulteration using FTIR spectroscopy and multivariate calibration. Talanta.

2008;76(5):1105-1112.

101. Bjelanovic M, Sorheim O, Slinde E, Puolanne E, Isaksson T, Egelandsdal B.

Determination of the myoglobin states in ground beef using non-invasive

reflectance spectrometry and multivariate regression analysis. Meat Science.

2013;95(3):451-457.

102. Macdonald JR, Johnson WC, Jr. Environmental features are important in

determining protein secondary structure. Protein Science. 2001;10(6):1172-

1177.

103. Yanagihara H, Tonda T, Matsumoto C. Bias correction of cross-validation

criterion based on kullback–leibler information under a general condition.

Journal of Multivariate Analysis. 2006;97(9):1965-1975.

104. Oberg KA, Ruysschaert JM, Goormaghtigh E. The optimization of protein

secondary structure determination with infrared and circular dichroism

spectra. European journal of biochemistry / FEBS. 2004;271(14):2937-2948.

105. Wold S, Martens H, Wold H. The multivariate calibration problem in

chemistry solved by the pls method. In: Kågström B, Ruhe A, editors. Matrix

pencils. Lecture notes in mathematics. 973: Springer Berlin Heidelberg;

1983.286-293.

106. D'Antonio J, Murphy BM, Manning MC, Al-azzam WA. Comparability of

protein therapeutics: Quantitative comparison of second-derivative amide i

infrared spectra. Journal of Pharmaceutical Sciences. 2012;101(6):2025-

2033.

107. Nunn S, Nishikida K. Advanced ATR correction algorithm: Thermo Fisher

Scientific; [cited 2013 20th Sept.].

108. Milosevic M. Internal reflection and ATR spectroscopy: Wiley; 2012.

109. Kabsch W, Sander C. Dictionary of protein secondary structure: Pattern

recognition of hydrogen-bonded and geometrical features. Biopolymers.

1983;22(12):2577-2637.

110. Wang L, Zimmermann O, Hansmann UHE. Prediction of parallel and anti-

parallel beta-sheets using conditional random fields. In: Mohanty S,

APPENDIX

196

Hansmann UHE, editors. From computational biophysics to systems biology.

34. Julich: John von Neumann Institute for Computing; 2006.201-204.

111. Shaopu Q, Jinhee K, Dalia A, Greg G. Effect of normalization on statistical

and biological interpretation of gene expression profiles. Frontiers in

Genetics. 2013;3.

112. Chen Y-C, Thennadil SN. Insights into information contained in

multiplicative scatter correction parameters and the potential for estimating

particle size from these parameters. Analytica Chimica Acta. 2012;746(0):37-

46.

113. Allison GG. Application of Fourier transform mid-infrared spectroscopy

(FTIR) for research into biomass feed stocks. . In: Nikolic G, editor. Fourier

transforms - new analytical approaches and FTIR strategies. Aberystwyth

University, United Kingdom: Institute of Biology, Environmental and Rural

Science; 2011.71-88.

114. Spencer ND. Tailoring surfaces: Modifying surface composition and

structure for applications in tribology, biology and catalysis: World

Scientific; 2011.

115. Mudunkotuwa IA, Minshid AA, Grassian VH. ATR-FTIR spectroscopy as a

tool to probe surface adsorption on nanoparticles at the liquid-solid interface

in environmentally and biologically relevant media. The Analyst.

2014;139(5):870-881.

116. van den Berg RA, Hoefsloot HC, Westerhuis JA, Smilde AK, van der Werf

MJ. Centering, scaling, and transformations: Improving the biological

information content of metabolomics data. BMC genomics. 2006;7:142.

117. Dong A, Huang P, Caughey WS. Protein secondary structures in water from

second-derivative amide i infrared spectra. Biochemistry. 1990;29(13):3303-

3308.

118. Allison GG. Application of Fourier transform mid-infrared spectroscopy

(FTIR) for research into biomass feed stocks. In: Nikolic G, editor. Fourier

transforms - new analytical approaches and FTIR strategies. Aberystwyth

University, United Kingdom: Institute of Biology, Environmental and Rural

Science; 2011.71-88.

119. Bonnier F, Byrne HJ. Understanding the molecular information contained in

principal component analysis of vibrational spectra of biological systems. The

Analyst. 2012;137(2):322-332.

120. Bonnier F, Rubin S, Debelle L, Venteo L, Pluot M, Baehrel B, et al. FTIR

protein secondary structure analysis of human ascending aortic tissues.

Journal of Biophotonics. 2008;1(3):204-214.

121. Guerra M, Lopez MA, Esteves I, Zubillaga AL, Croquer A. Fourier-

transformed infrared spectroscopy: A tool to identify gross chemical changes



from healthy to yellow band disease tissues. Diseases of Aquatic Organisms.

2014;107(3):249-258.

122. Kubota T, Hanamura J, Kano K, Uno B. Application of principal component

analysis to the study of quantitative structure-activity relationships by means

of multiple regression analysis. Chemical & Pharmaceutical Bulletin.

1985;33(4):1488-1495.

123. Brereton RG. Chemometrics: Data analysis for the laboratory and chemical

plant: Wiley; 2003.

124. Razifar P, Engler H, Blomquist G, Ringheim A, Estrada S, Langstrom B, et

al. Principal component analysis with pre-normalization improves the signal-

to-noise ratio and image quality in positron emission tomography studies of

amyloid deposits in alzheimer's disease. Physics in Medicine and Biology.

2009;54(11):3595-3612.

125. Bonnier F, Mehmood A, Knief P, Meade AD, Hornebeck W, Lambkin H, et

al. Erratum: In vitro analysis of immersed human tissues by Raman

microspectroscopy. Journal of Raman Spectroscopy. 2011;42(8):1711-1711.

126. Benar P, Gonçalves AR, Mandelli D, Ferreira MMC, Schuchardt U. Principal

component analysis of the hydroxymethylation of sugarcane lignin: A time-

depending study by FTIR. Journal of Wood Chemistry and Technology.

1999;19(1-2):151-165.

127. Navea S, Tauler R, Goormaghtigh E, de Juan A. Chemometric tools for

classification and elucidation of protein secondary structure from infrared and

circular dichroism spectroscopic measurements. Proteins. 2006;63(3):527-

541.

128. Capy P. A principal component analysis program. The Journal of heredity.

1985;76(5):401-402.

129. Forner-Cordero A, Levin O, Li Y, Swinnen SP. Principal component analysis

of complex multijoint coordinative movements. Biological Cybernetics.

2005;93(1):63-78.

130. Ma S, Dai Y. Principal component analysis based methods in bioinformatics

studies. Briefings in Bioinformatics. 2011;12(6):714-722.

131. Marti J, Santa-Cruz MC, Serra R, Valero O, Molina V, Hervas JP, et al.

Principal component and cluster analysis of morphological variables reveals

multiple discrete sub-phenotypes in weaver mouse mutants. Cerebellum.

2013;12(3):406-417.

132. Ringner M. What is principal component analysis? Nature Biotechnology.

2008;26(3):303-304.

133. Scholz M. Validation of nonlinear PCA. Neural Process Lett. 2012;36(1):21-

30.

APPENDIX

198

134. Bestley NA. Computing protein infrared spectroscopy with quantum

chemistry. Phil Trans R A. 2007;365(1861):2799-2812.

135. Bell AJ, Sejnowski TJ. The “independent components” of natural scenes are

edge filters. Vision Research. 1997;37(23):3327-3338.

136. Dong A, Jones LS, Kerwin BA, Krishnan S, Carpenter JF. Secondary

structures of proteins adsorbed onto aluminum hydroxide: Infrared

spectroscopic analysis of proteins from low solution concentrations.

Analytical Biochemistry. 2006;351(2):282-289.

137. Pelton JT, McLean LR. Spectroscopic methods for analysis of protein

secondary structure. Anal Biochem. 2000;277(2):167-176.

138. Adochitei A, Drochioiu G. Rapid characterization of peptide secondary

structure by ft-ir spectroscopy. Rev Roum Chim. 2011;56(8):783-791.

139. Chirgadze YN, Nevskaya NA. Infrared spectra and resonance interaction of

amide-i vibration of the antiparallel-chain pleated sheet. Biopolymers.

1976;15(4):607-625.

140. Barth A. Infrared spectroscopy of proteins. Biochimica et Biophysica Acta

(BBA) - Bioenergetics. 2007;1767(9):1073-1101.

141. Babrah J, McCarthy K, Lush RJ, Rye AD, Bessant C, Stone N. Fourier

transform infrared spectroscopic studies of T-cell lymphoma, B-cell lymphoid

and myeloid leukaemia cell lines. The Analyst. 2009;134(4):763-768.

142. Iban A, Simon P, Wiwat N, Elmar HH, Alexander AG, Florian H, et al.

Structural analysis and mapping of individual protein complexes by infrared

nanospectroscopy. Nature Communications. 2013;4.

143. Kurgan LA, Zhang T, Zhang H, Shen S, Ruan J. Secondary structure-based

assignment of the protein structural classes. Amino Acids. 2008;35(3):551-

564.

144. Adzhubei AA, Eisenmenger F, Tumanyan VG, Zinke M, Brodzinski S,

Esipova NG. Approaching a complete classification of protein secondary

structure. Journal of Biomolecular Structure & Dynamics. 1987;5(3):689-

704.

145. Shi J-Y, Yiu S-M, Zhang Y-N, Chin F-L. Effective moment feature vectors

for protein domain structures. PloS One. 2013;8(12):1-11.

146. Bonifazzi C, Di Domenico G, Lodi E, Maino G, Tartari A. Principal

component analysis of large layer density in compton scattering

measurements. Applied radiation and isotopes : including data,

instrumentation and methods for use in agriculture, Industry and Medicine.

2000;53(4-5):571-579.

147. Ergon R. Informative pls score-loading plots for process understanding and

monitoring. Journal of Process Control. 2004;14(8):889-897.



148. Seierstad T, Roe K, Sitter B, Halgunset J, Flatmark K, Ree AH, et al.

Principal component analysis for the comparison of metabolic profiles from

human rectal cancer biopsies and colorectal xenografts using high-resolution

magic angle spinning 1h magnetic resonance spectroscopy. Molecular

Cancer. 2008;7:33.

149. Lourenco ND, Paixao F, Pinheiro HM, Sousa A. Use of spectra in the visible

and near-mid-ultraviolet range with principal component analysis and partial

least squares processing for monitoring of suspended solids in municipal

wastewater treatment plants. Applied Spectroscopy. 2010;64(9):1061-1067.

150. Wu TD, Hastie T, Schmidler SC, Brutlag DL. Regression analysis of multiple

protein structures. Proceedings of the second annual international conference

on Computational molecular biology; New York, New York, USA. 279132:

ACM; 1998.276-284.

151. Goormaghtigh E, Ruysschaert JM, Raussens V. Evaluation of the information

content in infrared spectra for protein secondary structure determination.

Biophys J. 2006;90(8):2946-2957.

152. Zhang L, Zhang LM, Li Y, Liu BP, Wang XF, Wang JD. Application and

improvement of partial-least-squares in Fourier transform infrared

spectroscopy. Guang pu. 2005;25(10):1610-1613.

153. Fadzlillah NA, Rohman A, Ismail A, Mustafa S, Khatib A. Application of

FTIR-ATR spectroscopy coupled with multivariate analysis for rapid

estimation of butter adulteration. Journal of Oleo Science. 2013;62(8):555-

562.

154. Warren G. FTIR analysis of protein structure: [cited 2010 Apr 20]. Available

from:

http://www.chem.uwec.edu/Chem455_S05/Pages/Manuals/FTIR_of_proteins

.pdf.

155. Ami D, Mereghetti P, Doglia SM. Multivariate analysis for Fourier transform

infrared spectra of complex biological systems and processes 2013 [cited

2014 Lan 12]. Available from: www.intechopen.com/books/multivariate-

analysis-in-management-engineering-and-the-sciences/multivariate-analysis-

for-Fourier-transform-infrared-spectra-of-complex-biological-systems-and-

proce.

156. Kong L, Zhang L, Lv J. Accurate prediction of protein structural classes by

incorporating predicted secondary structure information into the general form

of chou's pseudo amino acid composition. Journal of Theoretical Biology.

2014;344:12-18.

157. Navea S, Tauler R, de Juan A. Application of the local regression method

interval partial least-squares to the elucidation of protein secondary structure.

Anal Biochem. 2005;336(2):231-242.

http://www.chem.uwec.edu/Chem455_S05/Pages/Manuals/FTIR_of_proteins.pdf

http://www.chem.uwec.edu/Chem455_S05/Pages/Manuals/FTIR_of_proteins.pdf

http://www.intechopen.com/books/multivariate-analysis-in-management-engineering-and-the-sciences/multivariate-analysis-for-fourier-transform-infrared-spectra-of-complex-biological-systems-and-proce




APPENDIX

200

158. Depczynski U, Frost VJ, Molt K. Genetic algorithms applied to the selection

of factors in principal component regression. Analytica chimica acta.

2000;420(2):217-227.

159. Haaland DM, Jones HDT, Thomas EV. Multivariate classification of the

infrared spectra of cell and tissue samples. Applied Spectroscopy.

1997;51(3):340-345.

160. Selection and representativity of the calibration sample subset [cited 2013 22

Nov]. Available from:

http://www.vub.ac.be/fabi/multi/pcr/pages/menu/mll.html.

161. Kennard RW, Stone LA. Computer aided design of experiments.

Technometrics. 1969;11(1):137-148.

162. Perez-Guaita D, Ventura-Gayete J, Pérez-Rambla C, Sancho-Andreu M,

Garrigues S, Guardia M. Protein determination in serum and whole blood by

attenuated total reflectance infrared spectroscopy. Analytical and

Bioanalytical Chemistry. 2012;404(3):649-656.

163. Duleba AJ, Olive DL. Regression analysis and multivariate analysis.

Seminars in reproductive endocrinology. 1996;14(2):139-153.

164. Jackson D, Riley RD. A refined method for multivariate meta-analysis and

meta-regression. Statistics in Medicine. 2014;33(4):541-554.

165. Luo S, Huang CY, McClelland JF, Graves DJ. A study of protein secondary

structure by Fourier transform infrared/photoacoustic spectroscopy and its

application for recombinant proteins. Anal Biochem. 1994;216(1):67-76.

166. Cogdill RP, Anderson CA, Delgado M, Chisholm R, Bolton R, Herkert T, et

al. Process analytical technology case study: Part ii. Development and

validation of quantitative near-infrared calibrations in support of a process

analytical technology application for real-time release. AAPS PharmSciTech.

2005;6(2):E273-283.

167. Li H-D, Liang Y-Z, Xu Q-S, Cao D-S. Model population analysis for variable

selection. Journal of Chemometrics. 2010;24(7-8):418-423.

168. Dalmas B, Bannister WH. Prediction of protein secondary structure from

circular dichroism spectra: An attempt to solve the problem of the best-fitting

reference protein subsets. Anal Biochem. 1995;225(1):39-48.

169. Feng J, Wang Z, Li L, Li Z, Ni W. A nonlinearized multivariate dominant

factor-based partial least squares (pls) model for coal analysis by using laser-

induced breakdown spectroscopy. Applied Spectroscopy. 2013;67(3):291-

300.

170. Gomes Ade A, Alcaraz MR, Goicoechea HC, Araujo MC. The successive

projections algorithm for interval selection in trilinear partial least-squares

with residual bilinearization. Analytica chimica acta. 2014;811:13-22.

http://www.vub.ac.be/fabi/multi/pcr/pages/menu/mll.html



171. Passos CP, Cardoso SM, Barros AS, Silva CM, Coimbra MA. Application of

Fourier transform infrared spectroscopy and orthogonal projections to latent

structures/partial least squares regression for estimation of procyanidins

average degree of polymerisation. Analytica chimica acta. 2010;661(2):143-

149.

172. Xu D, Fan W, Lv H, Liang Y, Shan Y, Li G, et al. Simultaneous

determination of traces amounts of cadmium, zinc, and cobalt based on uv-vis

spectrometry combined with wavelength selection and partial least squares

regression. Spectrochimica acta Part A, Molecular and Biomolecular

spectroscopy. 2014;123:430-435.

173. Gallardo-Velázquez T, Osorio-Revilla G, Cárdenas-Bailón F, Beltrán-Orozco

MC. Determination of ternary solutions concentration in liquid–liquid

extraction by the use of attenuated total reflectance-Fourier transform infrared

spectroscopy and multivariate data analysis. The Canadian Journal of

Chemical Engineering. 2008;86(1):77-83.

174. Oberg K, Ruysschaert J, Goormaghtigh E. The optimization of protein

secondary structure determination with infrared and circular dichroism

spectra. European journal of biochemistry / FEBS. 2004;271:2937 - 2948.

175. Lees J, Miles A, Janes R, Wallace B. Novel methods for secondary structure

determination using low wavelength (vuv) circular dichroism spectroscopic

data. BMC Bioinformatics. 2006;7(1):507.

176. Janik L, Skjemstad J, Raven M. Characterization and analysis of soils using

mid-infrared partial least-squares .1. Correlations with xrf-determined major-

element composition. Soil Research. 1995;33(4):621-636.

177. Kinalwa MN, Blanch EW, Doig AJ. Accurate determination of protein

secondary structure content from Raman and Raman optical activity spectra.

Analytical Chemistry. 2010;82(15):6347-6349.

178. Huyghues-Despointes BP, Pace CN, Englander SW, Scholtz JM. Measuring

the conformational stability of a protein by hydrogen exchange. In: Murphy

K, editor. Protein structure, stability, and folding. Methods in Molecular

biology. 168: Humana Press; 2001.69-92.

179. Norgaard L. Interval partial least squares 2005 [cited 2011 12 June].

Available from: http://www.models.kvl.dk/source/clipls/index.asp.

180. Zou X, Zhao J, Mao H, Shi J, Yin X, Li Y. Genetic algorithm interval partial

least squares regression combined successive projections algorithm for

variable selection in near-infrared quantitative analysis of pigment in

cucumber leaves. Applied Spectroscopy. 2010;64(7):786-794.

181. Vedantham G, Sparks HG, Sane SU, Tzannis S, Przybycien TM. A holistic

approach for protein secondary structure estimation from infrared spectra in

h(2)o solutions. Anal Biochem. 2000;285(1):33-49.

http://www.models.kvl.dk/source/clipls/index.asp

APPENDIX

202

182. Norgaard L, Saudland A, Wagner J, Nielsen JP, Munck L, Engelsen SB.

Interval partial least-squares regression (ipls): A comparative chemometric

study with an example from near-infrared spectroscopy. Applied

Spectroscopy. 2000;54(3):413-419.

183. Kuwajima K. The molten globule state of α-lactalbumin. FASEB.

1996;10(1):102-109.

184. Kim S, Baum J. Electrostatic interactions in the acid denaturation of α-

lactalbumin determined by NMR. Protein Science. 1998;7(9):1930-1938.

185. Ashton L, Blanch EW. Ph-induced conformational transitions in α-

lactalbumin investigated with two-dimensional Raman correlation variance

plots and moving windows. Journal of Molecular Structure. 2010;974(1–

3):132-138.

186. Pieters S, Vander Heyden Y, Roger J-M, D’Hondt M, Hansen L, Palagos B,

et al. Raman spectroscopy and multivariate analysis for the rapid

discrimination between native-like and non-native states in freeze-dried

protein formulations. European Journal of Pharmaceutics and

Biopharmaceutics. 2013;85(2):263-271.

187. Sola RJ, Griebenow K. Glycosylation of therapeutic proteins: An effective

strategy to optimize efficacy. BioDrugs : Clinical Immunotherapeutics,

Biopharmaceuticals and Gene Therapy. 2010;24(1):9-21.

188. Owen SC, Doak AK, Wassam P, Shoichet MS, Shoichet BK. Colloidal

aggregation affects the efficacy of anticancer drugs in cell culture. ACS

Chemical Biology. 2012;7(8):1429-1435.

189. Frokjaer S, Otzen DE. Protein drug stability: A formulation challenge. Nat

Rev Drug Discov. 2005;4(4):298-306.

190. Bandyopadhyay S, Chakraborty S, Bagchi B. Exploration of the secondary

structure specific differential solvation dynamics between the native and

molten globule states of the protein hp-36. The Journal of Physical Chemistry

B. 2006;110(41):20629-20634.

191. Hauser K. Infrared spectroscopy of protein folding, misfolding and

aggregation. In: Roberts GK, editor. Encyclopedia of biophysics: Springer

Berlin Heidelberg; 2013.1089-1095.

192. Aguzzi A, O'Connor T. Protein aggregation diseases: Pathogenicity and

therapeutic perspectives. Nat Rev Drug Discov. 2010;9(3):237-248.

193. Schulman BA, Redfield C, Peng Z-Y, Dobson CM, Kim PS. Different

subdomains are most protected from hydrogen exchange in the molten

globule and native states of human α-lactalbumin. Journal of Molecular

Biology. 1995;253(5):651-657.

194. Wu LC, Peng Z-Y, Kim PS. Bipartite structure of the α-lactalbumin molten

globule. Nature Structural & Molecular Biology. 1995;2(4):281-286.



195. Poklar N, Lah J, Salobir M, Macek P, Vesnaver G. Ph and temperature-

induced molten globule-like denatured states of equinatoxin ii: A study by

uv-melting, dsc, far- and near-uv cd spectroscopy, and ans fluorescence.

Biochemistry. 1997;36(47):14345-14352.

196. Navea S, de Juan A, Tauler R. Three-way data analysis applied to

multispectroscopic monitoring of protein folding. Analytica Chimica Acta.

2001;446(1–2):185-195.

197. Prajapati RS, Indu S, Varadarajan R. Identification and thermodynamic

characterization of molten globule states of periplasmic binding proteins†.

Biochemistry. 2007;46(36):10339-10352.

198. Ptitsyn O. Protein folding: Hypotheses and experiments. Journal of Protein

Chemistry. 1987;6(4):273-293.

199. Stefani M. Protein misfolding and aggregation: New examples in medicine

and biology of the dark side of the protein world. Biochimica et Biophysica

Acta (BBA) - Molecular Basis of Disease. 2004;1739(1):5-25.

200. Aguzzi A, Calella AM. Prions: Protein Aggregation and Infectious Diseases

2009. 1105-1152.

201. Agnès T, Diane R, Yves D, Dieter N, Vincent F. Transient non-native

secondary structures during the refolding of α-lactalbumin detected by

infrared spectroscopy. Nature Structural & Molecular Biology. 2000;7(1):78-

86.

202. Horng JC, Demarest SJ, Raleigh DP. Ph-dependent stability of the human

alpha-lactalbumin molten globule state: Contrasting roles of the 6 - 120

disulfide and the beta-subdomain at low and neutral ph. Proteins.

2003;52(2):193-202.

203. Chamberlain AK, Marqusee S. Molten globule unfolding monitored by

hydrogen exchange in urea†. Biochemistry. 1998;37(7):1736-1742.

204. Yang J, Dunker AK, Powers JR, Clark S, Swanson BG. Β-lactoglobulin

molten globule induced by high pressure. Journal of Agricultural and Food

Chemistry. 2001;49(7):3236-3243.

205. Schulman BA, Kim PS, Dobson CM, Redfield C. A residue-specific NMR

view of the non-cooperative unfolding of a molten globule. Nature Structural

Biology. 1997;4(8):630-634.

206. Fearn T. Principal component discriminant analysis. Statistical applications in

genetics and molecular biology. Ecopapers 2008;7(2):1-6.

207. Gorga JC, Dong A, Manning MC, Woody RW, Caughey WS, Strominger JL.

Comparison of the secondary structures of human class i and class ii major

histocompatibility complex antigens by Fourier Transform Infrared and

Circular Dichroism spectroscopy. Proc Natl Acad Sci U S A.

1989;86(7):2321-2325.

APPENDIX

204

208. Kaiden K, Matsui T, Tanaka S. A study of the amide iii band by ft-ir

spectrometry of the secondary structure of albumin, myoglobin. Applied

spectroscopy. 1987;41(2):180-184.

209. Anderle G, Mendelsohn R. Thermal denaturation of globular proteins. Fourier

transform-infrared studies of the amide III spectral region. Biophysical

Journal. 1987;52(1):69-74.

210. Fen N, DeOliveira DB, Trumble WR, Sarkar HK, Singh BR. Secondary

structure estimation of proteins using the amide iii region of Fourier

transform infrared spectroscopy: Application to analyze calcium-binding-

induced structural changes in calsequestrin. Applied Spectroscopy.

1994;48(11):1432-1441.

211. Sakurai K, Goto Y. Principal component analysis of the ph-dependent

conformational transitions of bovine beta-lactoglobulin monitored by

heteronuclear NMR. Proc Natl Acad Sci U S A. 2007;104(39):15346-15351.

212. Maisuradze GG, Liwo A, Scheraga HA. Principal component analysis for

protein folding dynamics. J Mol Biol. 2009;385(1):312-329.

213. Nieuwoudt HH, Prior BA, Pretorius IS, Manley M, Bauer FF. Principal

component analysis applied to Fourier transform infrared spectroscopy for the

design of calibration sets for glycerol prediction models in wine and for the

detection and classification of outlier samples. Journal of Agricultural and

Food Chemistry. 2004;52(12):3726-3735.

214. Furlan PY, Scott SA, Peaslee MH. FTIR‐ATR study of ph effects on egg

albumin secondary structure. Spectroscopy Letters. 2007;40(3):475-482.

215. Arai M, Kuwajima K. Rapid formation of a molten globule intermediate in

refolding of α-lactalbumin. Folding and Design. 1996;1(4):275-287.

216. Alexandrescu AT, Evans PA, Pitkeathly M, Baum J, Dobson CM. Structure

and dynamics of the acid-denatured molten globule state of .Alpha.-

lactalbumin: A two-dimensional NMR study. Biochemistry. 1993;32(7):1707-

1718.

217. Rösner HI, Redfield C. The human α-lactalbumin molten globule:

Comparison of structural preferences at ph 2 and ph 7. Journal of Molecular

Biology. 2009;394(2):351-362.

218. Bramaud C, Aimar P, Daufin G. Thermal isoelectric precipitation of α-

lactalbumin from a whey protein concentrate: Influence of protein–calcium

complexation. Biotechnology and Bioengineering. 1995;47(2):121-130.

219. Shanbhag VP, Johansson G, Ortin A. Ca2+ and ph dependence of

hydrophobicity of alpha-lactalbumin: Affinity partitioning of proteins in

aqueous two-phase systems containing poly(ethylene glycol) esters of fatty

acids. Biochemistry international. 1991;24(3):439-450.



220. Lefèvre T, Subirade M. Formation of intermolecular β-sheet structures: A

phenomenon relevant to protein film structure at oil–water interfaces of

emulsions. Journal of Colloid and Interface Science. 2003;263(1):59-67.

221. Nozaka M, Kuwajima K, Nitta K, Sugai S. Detection and characterization of

the intermediate on the folding pathway of human α-lactalbumin.

Biochemistry. 1978;17(18):3753-3758.

222. Schönbrunner N, Wey J, Engels J, Georg H, Kiefhaber T. Native-like β-

structure in a trifluoroethanol-induced partially folded state of the all-β-sheet

protein tendamistat. Journal of Molecular Biology. 1996;260(3):432-445.

223. Mizuguchi M, Hashimoto D, Sakurai M, Nitta K. Cold denaturation of alpha-

lactalbumin. Proteins. 2000;38(4):407-413.

224. Litwińczuk A, Ryu SR, Nafie LA, Lee JW, Kim HI, Jung YM, et al. The

transition from the native to the acid-state characterized by multi-

spectroscopy approach: Study for the holo-form of bovine α-lactalbumin.

Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics.

2014;1844(3):593-606.

225. Armbruster DA, Pry T. Limit of blank, limit of detection and limit of

quantitation. The Clinical biochemist Reviews / Australian Association of

Clinical Biochemists. 2008;29 Suppl 1:S49-52.

226. Sigurdsson EM. Amyloid proteins: Methods and protocols: Humana Press;

2005.

227. Jiskoot W, Crommelin D. Methods for structural analysis of protein

pharmaceuticals: American Assoc. of Pharm. Scientists; 2005.

228. Armbruster DA, Tillman MD, Hubbs LM. Limit of detection (lqd)/limit of

quantitation (loq): Comparison of the empirical and the statistical methods

exemplified with gc-ms assays of abused drugs. Clinical Chemistry.

1994;40(7):1233-1238.

229. Hughes J, Ayoko G, Collett S, Golding G. Rapid quantification of

methamphetamine: Using attenuated total reflectance Fourier transform

infrared spectroscopy (ATR-FTIR) and chemometrics. Plos One.

2013;8(7):1-7.

230. Francis G. Albumin and mammalian cell culture: Implications for

biotechnology applications. Cytotechnology. 2010;62(1):1-16.

231. Huang BX, Kim H-Y, Dass C. Probing three-dimensional structure of bovine

serum albumin by chemical cross-linking and mass spectrometry. Journal of

the American Society for Mass Spectrometry. 2004;15(8):1237-1247.

232. Carter DC, Ho JX. Structure of serum albumin. Advances in Protein

Chemistry. 1994;45:153-203.

APPENDIX

206

233. Fries D. The early use of fibrinogen, prothrombin complex concentrate, and

recombinant-activated factor viia in massive bleeding. Transfusion. 2013;53

Suppl 1:91S-95S.

234. Jolles P. A possible physiological function of lysozyme. Biomedicine /

[publiee pour l'AAICIG]. 1976;25(8):275-276.

235. Proctor VA, Cunningham FE. The chemistry of lysozyme and its use as a

food preservative and a pharmaceutical. Critical Reviews in Food Science and

Nutrition. 1988;26(4):359-395.

236. Smith DI, Swamy PM, Heffernan MP. Off-label uses of biologics in

dermatology: Interferon and intravenous immunoglobulin (part 1 of 2). J Am

Acad Dermatol. 2007;56(1):e1-54.

237. Kontopidis G, Holt C, Sawyer L. Invited review: Beta-lactoglobulin: Binding

properties, structure, and function. Journal of Dairy Science. 2004;87(4):785-

796.

238. Lecker SH, Goldberg AL, Mitch WE. Protein degradation by the ubiquitin-

proteasome pathway in normal and disease states. Journal of the American

Society of Nephrology : JASN. 2006;17(7):1807-1819.

239. Rahmelow K, Hübner W. Secondary structure determination of proteins in

aqueous solution by infrared spectroscopy: A comparison of multivariate data

analysis methods. Analytical Biochemistry. 1996;241(1):5-13.

240. Tetin SY, Prendergast FG, Venyaminov SY. Accuracy of protein secondary

structure determination from circular dichroism spectra based on

immunoglobulin examples. Analytical Biochemistry. 2003;321(2):183-187.

241. Chen Y-H, Yang JT, Chau KH. Determination of the helix and β form of

proteins in aqueous solution by circular dichroism. Biochemistry.

1974;13(16):3350-3359.

242. Matsuo K, Yonehara R, Gekko K. Improved estimation of the secondary

structures of proteins by vacuum-ultraviolet circular dichroism spectroscopy.

Journal of biochemistry. 2005;138(1):79-88.

243. Louis-Jeune C, Andrade-Navarro MA, Perez-Iratxeta C. Prediction of protein

secondary structure from circular dichroism using theoretically derived

spectra. Proteins. 2011.

244. Lees JG, Janes RW. Combining sequence-based prediction methods and

circular dichroism and infrared spectroscopic data to improve protein

secondary structure determinations. BMC Bioinformatics. 2008;9:24.

245. Byler M, Susi H. Examination of the secondary structure of proteins by

deconvolved FTIR spectra. Biopolymers. 1986;25(3):469-487.

246. Brahms S, Brahms J. Determination of protein secondary structure in solution

by vacuum ultraviolet circular dichroism. J Mol Biol. 1980;138(2):149-178.



247. Borissova A, Khan S, Mahmud T, Roberts KJ, Andrews J, Dallin P, et al. In

situ measurement of solution concentration during the batch cooling

crystallization of l-glutamic acid using ATR-FTIR spectroscopy coupled with

chemometrics. Crystal Growth & Design. 2008;9(2):692-706.

248. Hawkins SA, Park B, Poole GH, Gottwald TR, Windham WR, Albano J, et

al. Comparison of FTIR spectra between huanglongbing (citrus greening) and

other citrus maladies. Journal of Agricultural and Food Chemistry.

2010;58(10):6007-6010.

249. Alexopoulos EC. Introduction to multivariate regression analysis.

Hippokratia. 2010;14(Suppl 1):23-28.

250. Alsberg BK, Kell DB, Goodacre R. Variable selection in discriminant partial

least-squares analysis. Analytical Chemistry. 1998;70(19):4126-4133.

251. Vitrai J, Czobor P, Simon G, Varga L, Marosfi S. Beyond principal

component analysis: Canonical component analysis for data reduction in

classification of eps. International journal of bio-medical computing. IEEE

1984;15(2):93-111.

252. Vidal R, Ma Y, Sastry S. Generalized principal component analysis (gpca).

IEEE transactions on pattern analysis and machine intelligence.

2005;27(12):1945-1959.

253. Muratore M. Raman spectroscopy and partial least squares analysis in

discrimination of peripheral cells affected by huntington's disease. Analytica

Chimica Acta. 2013;793:1-10.

254. Kaltashov IA, Eyles SJ, Desiderio DM. Mass spectrometry in structural

biology and biophysics: Architecture, Dynamics, and Interaction of

Biomolecules: Wiley; 2012.

255. Williams RW. Protein secondary structure analysis using Raman amide i and

amide iii spectra. Methods in Enzymology. 1986;130:311-331.

256. Tan AC, Gilbert D, Deville Y. Multi-class protein fold classification using a

new ensemble machine learning approach. Genome informatics International

Conference on Genome Informatics. 2003;14:206-217.

257. Kinalwa M, Blanch EW, Doig AJ. Determination of protein fold class from

Raman or Raman optical activity spectra using random forests. Protein

Science. 2011;20(10):1668-1674.

258. Zira AN, Theocharis SE, Mitropoulos D, Migdalis V, Mikros E. 1h NMR

metabonomic analysis in renal cell carcinoma: A possible diagnostic tool.

Journal of proteome research. 2010;9(8):4038-4044.

259. Finan DA, Zisser H, Jovanovic L, Bevier WC, Seborg DE. Automatic

detection of stress states in type 1 diabetes subjects in ambulatory conditions.

Industrial & Engineering Chemistry Research. 2010;49(17):7843-7848.

APPENDIX

208

260. Tarachiwin L, Masako O, Fukusaki E. Quality evaluation and prediction of

citrullus lanatus by 1H NMR-based metabolomics and multivariate analysis.

Journal of agricultural and food chemistry. 2008;56(14):5827-5835.

261. Antoniewicz MR, Stephanopoulos G, Kelleher JK. Evaluation of regression

models in metabolic physiology: Predicting fluxes from isotopic data without

knowledge of the pathway. Metabolomics. 2006;2(1):41-52.

262. Cramer R, III. Partial least squares (pls): Its strengths and limitations.

Perspectives in Drug Discovery and Design. 1993;1(2):269-278.

263. Zeng X-Q, Li G-Z. Incremental partial least squares analysis of big streaming

data. Pattern Recognition. 2014;47(11):3726-3735.

264. Kennedy DF, Crisma M, Toniolo C, Chapman D. Studies of peptides forming

310- and .Alpha.-helixes and .Beta.-bend ribbon structures in organic solution

and in model biomembranes by Fourier transform infrared spectroscopy.

Biochemistry. 1991;30(26):6541-6548.

265. Singh BR, DeOliveira DB, Fu F-N, Fuller MP, editors. Fourier transform

infrared analysis of amide III bands of proteins for the secondary structure

estimation. Proc. SPIE 1993:1890: 46-55.

266. Kabsch W, Sander C. Dictionary of protein secondary structure: Pattern

recognition of hydrogen-bonded and geometrical features. Biopolymers.

1983;22:2577 - 2637.



Appendix A

List of proteins and their percentage motifs as determined by DSSP based on the

atomic resonance contained in their PDF files. This list also contains the SCOP

classifications of the proteins. For the detailed PCA analysis, see Chapter 3 of this

Thesis.

APPENDIX

210

Table A1. FTIR Protein Band Assignments

Amide I C=O stretching (80%), N-H

bending (20%) Spectral Region

(1700 – 1600 cm-1

)

α-helix 1658 – 1654 cm-1

β-sheet 1642 – 1624 cm-1

310-helix 1660 cm-1

, 1645 cm-1

Disordered 1648 –

1640 cm-1

Turns 1672 cm-1

, 1680 cm-1

- 1688 cm

-1

Amide II N-H bending (60%), C-N

stretching (40%)

(1600 - 1480 cm-1

)

α-helix 1545 cm-1

Β-sheet 1530 - 1526 cm-1

310-helix 1533 - 1531 cm

-1

Turns 1538 – 1536 cm-1

Disordered 1432 –1441 cm-1

Amide III C–N stretch (40%), N–H bend

(30%), C-C stretching (20%)

(1200 - 1350 cm-1

)

α-helix 1328 - 1289 cm-1

Β-sheet 1255 - 1224 cm-1

Turns 1295 - 1270 cm-1

Disordered 1288 - 1256 cm-1

Amide I and Amide II Characteristics infrared band by Kennedy et al.(264)

Amide III Characteristics infrared band by Singh et al(265)



Table A2. Secondary structure fractions of 90 proteins obtained from DSSP.(266)

These proteins were used as reference set for the multivariate calibration of FTIR spectra of

proteins in H2O buffer.

Note: the secondary structure proportions should add up to 1 (+-0.01); parallel β, anti-

parallel β, and mixed β sum up to β-sheet and should not be added in the calculation.

PDBNum ProteinName Label α β Parallel βAntiParallel β Mixed 310-helixTurns Other

3V03 albumin (bovine) alpha 0.70 0.00 0.00 0.00 0.00 0.04 0.09 0.16

1A06 albumin (human) alpha 0.69 0.00 0.00 0.00 0.00 0.02 0.09 0.19

1E7H albumin (sheep) alpha 0.71 0.00 0.00 0.00 0.00 0.02 0.09 0.17

6ADH alcohol dehydrogenase alpha/beta 0.16 0.19 0.07 0.11 0.01 0.02 0.18 0.45

1QHO alpha amylase_type2 (basillus species)beta 0.20 0.29 0.06 0.20 0.01 0.03 0.14 0.34

1RQW alpha casein (bovine milk) alpha/beta 0.36 0.15 0.00 0.13 0.00 0.08 0.10 0.31

1HFZ alpha_lactalbumin (human) alpha 0.31 0.07 0.00 0.07 0.00 0.14 0.18 0.31

3CWL alpha-1-antitrypsin alpha/beta 0.28 0.31 0.03 0.21 0.06 0.02 0.10 0.28

1LF6 amyloglucanase(Asp. Niger) alpha 0.32 0.21 0.00 0.20 0.00 0.03 0.12 0.33

1YJ0 annexinV/without calcium alpha 0.68 0.00 0.00 0.00 0.00 0.10 0.07 0.25

4GCR apha-Crystallin(Bovine eye lens) beta 0.03 0.46 0.00 0.38 0.00 0.06 0.08 0.37

3LDJ aprotinin (bovine lung) alpha/beta 0.14 0.25 0.00 0.25 0.00 0.07 0.07 0.46

1SLF avidin beta 0.03 0.55 0.00 0.53 0.00 0.05 0.05 0.32

2BRD bacteriorhodopsin in purple membrane alpha 0.77 0.00 0.00 0.00 0.00 0.00 0.02 0.21

3IF7 calmodulin alpha 0.59 0.06 0.00 0.06 0.00 0.00 0.07 0.28

1V94 carbonic anhydrase (bovin) alpha/beta 0.08 0.30 0.05 0.19 0.05 0.08 0.11 0.42

3M1K carbonic anhydrase (human) alpha/beta 0.08 0.29 0.04 0.19 0.05 0.08 0.11 0.44

3NNE choline oxidase alpha/beta 0.21 0.21 0.06 0.12 0.01 0.05 0.14 0.40

1YPH chymotrypsin(bovin) beta 0.00 0.35 0.00 0.32 0.00 0.02 0.18 0.45

1ATH cleaved form of anti-thrombin beta 0.27 0.32 0.03 0.25 0.03 0.02 0.11 0.28

1BKV collagen(calf skin) designed p 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

1OVT conalbumin.dpt(chicken egg white) alpha/beta 0.28 0.18 0.06 0.11 0.00 0.05 0.16 0.34

3CNA concanavalin(jack bean) beta 0.00 0.41 0.00 0.41 0.00 0.00 0.09 0.50

1HRC cytochrome (Horse heart) alpha 0.41 0.00 0.00 0.00 0.00 0.00 0.17 0.41

1MPX d_amino acid dehydrogenase beta 0.18 0.27 0.06 0.18 0.01 0.05 0.12 0.38

2GL9 d-glucosamin alpha 0.44 0.15 0.02 0.10 0.02 0.00 0.12 0.28

2OII elastin beta 0.00 0.31 0.00 0.29 0.00 0.00 0.08 0.61

1W17 esterase (bacillus sybstilis) alpha/beta 0.40 0.15 0.13 0.02 0.00 0.03 0.10 0.32

1T1C esterase (rhizopus oryzae) alpha/beta 0.25 0.28 0.00 0.27 0.00 0.02 0.14 0.31

3F32 ferritin (Horse spleen) alpha 0.75 0.00 0.00 0.00 0.00 0.00 0.07 0.19

1DLP fetuin (calf serum) beta 0.00 0.32 0.00 0.31 0.00 0.00 0.09 0.58

3GHG fibrinogen(bovine plasma) alpha 0.70 0.02 0.00 0.01 0.01 0.02 0.06 0.21

2EIB galactose oxidase beta 0.01 0.38 0.01 0.34 0.01 0.02 0.13 0.46

1GAL glucose oxidase (aspegillus niger) alpha/beta 0.27 0.20 0.05 0.12 0.01 0.07 0.12 0.33

2QSP Hemoglobin bovine alpha 0.67 0.00 0.00 0.00 0.00 0.09 0.11 0.14

1BG3 hexakinase alpha/beta 0.44 0.15 0.05 0.08 0.02 0.03 0.11 0.27

3AGV IgG (Human) beta 0.08 0.37 0.00 0.37 0.00 0.04 0.08 0.42

1IGT IgG {bovine) beta 0.06 0.42 0.00 0.42 0.00 0.02 0.08 0.42

2OLY insulin (human) alpha 0.69 0.12 0.00 0.12 0.00 0.12 0.00 0.08

1W2T invertase(from baker's yeast) beta 0.00 0.57 0.01 0.49 0.00 0.03 0.10 0.30

3BLX isocitric dehydrogenase alpha 0.36 0.22 0.07 0.11 0.02 0.05 0.11 0.26

3H3F l_lactic (rabbit muscle) alpha/beta 0.40 0.21 0.08 0.11 0.00 0.04 0.11 0.24

1EZ4 lactic dehydrogenase alpha/beta 0.45 0.20 0.08 0.11 0.00 0.07 0.06 0.22

1IOA lectin ( p.ulgaris) beta 0.02 0.43 0.00 0.43 0.00 0.04 0.14 0.37

1HNA lglutathion (reduced) alpha 0.51 0.09 0.04 0.04 0.01 0.03 0.12 0.26

1CRL lipase (candida rugosa) alpha/beta 0.33 0.14 0.06 0.05 0.01 0.05 0.13 0.36

10IL lipase (wheat) alpha/beta 0.36 0.18 0.09 0.07 0.00 0.02 0.16 0.28

2LYZ lysozyme (chicken egg) alpha 0.32 0.06 0.00 0.06 0.00 0.10 0.24 0.28

2MLT melittin pH 6 alpha 0.88 0.00 0.00 0.00 0.00 0.00 0.04 0.08

APPENDIX

212

Continuation of Table A2. Secondary structure fractions of 90 proteins obtained

from DSSP

Note: the secondary structure proportions should add up to 1 (+-0.01); parallel β, anti-

parallel β, and mixed β sum up to β-sheet and should not be added in the calculation.

1NZ2 myoglobin(horse muscle) alpha 0.74 0.00 0.00 0.00 0.00 0.00 0.12 0.14

3DLW native anti-chymotrypsin alpha/beta 0.29 0.36 0.02 0.31 0.02 0.02 0.07 0.27

1T1F native form of anti-thrombin alpha/beta 0.26 0.27 0.03 0.21 0.03 0.04 0.10 0.33

2OAY native form of c1 inhibitor alpha/beta 0.27 0.32 0.01 0.28 0.01 0.02 0.09 0.30

1OVA native ovalbumin alpha/beta 0.30 0.32 0.03 0.22 0.05 0.03 0.13 0.23

9PAP papain(papaya lanex) alpha/beta 0.23 0.17 0.01 0.15 0.01 0.03 0.11 0.46

1T5A paruvate kinase alpha 0.36 0.20 0.10 0.08 0.01 0.04 0.10 0.30

1YX9 pepsin(pepsin gastric mucosa) beta 0.10 0.42 0.04 0.32 0.03 0.04 0.13 0.31

1HCH peroxidase (horseradish) alpha 0.44 0.02 0.00 0.02 0.00 0.06 0.17 0.31

1LSH phosvitin(from egg white) alpha/beta 0.29 0.37 0.00 0.35 0.00 0.02 0.10 0.23

2AXT phtosystem II reaction centre alpha 0.52 0.01 0.01 0.00 0.00 0.04 0.10 0.33

3ENJ pig citrate synthase alpha 0.59 0.03 0.00 0.03 0.00 0.03 0.10 0.24

2ZFG porin beta 0.04 0.59 0.00 0.56 0.00 0.00 0.14 0.24

3AJ8 proteinase (from bovine pancreas) alpha/beta 0.24 0.23 0.14 0.08 0.01 0.03 0.16 0.34

1QOV reaction centere R. Sphae. alpha 0.18 0.22 0.00 0.18 0.02 0.06 0.12 0.42

2I37 rhodopsin bleached alpha 0.55 0.04 0.00 0.04 0.00 0.04 0.12 0.25

1HZX rhodopsin unbleached alpha 0.59 0.04 0.00 0.04 0.00 0.03 0.10 0.25

1RBX ribonucleaus A (bovine pancreas) alpha/beta 0.20 0.30 0.00 0.28 0.01 0.02 0.06 0.41

1RBB ribonucleaus B(bovine pancreas) alpha/beta 0.18 0.33 0.00 0.31 0.01 0.03 0.15 0.31

3DJV ribonucleaus S ( bovine pancreas) alpha/beta 0.19 0.33 0.00 0.31 0.01 0.05 0.11 0.32

2ACH sact1 DIFF beta 0.31 0.34 0.02 0.29 0.02 0.02 0.09 0.23

1U06 spectrin beta 0.00 0.45 0.00 0.42 0.02 0.05 0.11 0.38

1AS4 split anti-chymotrypsin alpha/beta 0.32 0.37 0.02 0.32 0.02 0.02 0.09 0.20

1LQ8 split form of c1 inhibitor beta 0.28 0.35 0.02 0.30 0.02 0.04 0.08 0.26

1H3E Split form ovalbumin alpha/beta 0.44 0.14 0.00 0.00 0.00 0.00 0.00 0.42

1SSN staphylokinase-B wild type alpha/beta 0.09 0.28 0.07 0.14 0.06 0.02 0.08 0.53

1GNS subtilisin alpha/beta 0.32 0.18 0.13 0.04 0.00 0.00 0.16 0.34

3CEI superoxide dismutase alpha 0.51 0.10 0.00 0.09 0.00 0.03 0.13 0.23

1RQW thaumatin (t.danielli) beta 0.11 0.36 0.00 0.27 0.05 0.01 0.12 0.40

3N21 thermolysin (bacillus t rokko) alpha 0.37 0.16 0.03 0.11 0.03 0.04 0.10 0.32

1QGD transketalose (E.coli) alpha 0.42 0.14 0.11 0.02 0.00 0.05 0.13 0.26

3A7T trypsin beta 0.07 0.32 0.00 0.30 0.00 0.03 0.15 0.43

3A7T trypsin (bovine) beta 0.07 0.32 0.00 0.30 0.00 0.03 0.15 0.43

1UHB trypsin (hog pancreas) beta 0.00 0.34 0.00 0.33 0.00 0.02 0.12 0.51

2TGA trypsinogen(bovine pancreas) beta 0.07 0.32 0.00 0.30 0.00 0.03 0.14 0.44

2J9Z tryptophan Synthase alpha/beta 0.32 0.06 0.00 0.06 0.00 0.10 0.24 0.28

1BBI typsin-chymotrypsin small protein 0.00 0.25 0.00 0.25 0.00 0.00 0.06 0.69

2ZCB ubiquitin (bovine) beta 0.16 0.31 0.03 0.23 0.05 0.08 0.16 0.28

4H9M urease (Canavalia ensiformis) alpha/beta 0.26 0.22 0.07 0.13 0.01 0.04 0.15 0.34

1BEB β-lactoglobulin beta 0.10 0.42 0.00 0.38 0.00 0.06 0.12 0.29



Appendix B

This section contains the results of the calibration and the cross-validation of PLS

analysis of each protein motif. Each protein was prepared in H2O and measured

using Fourier Transformed Infrared Spectroscopy. Some proteins were also

prepared in D2O and their figures are also listed in tables in this section. For

detailed analysis of the PLS regression method, see Chapter 4.

APPENDIX

214

Table B1. Secondary Structure FTIR/PLS predicted result of each individual sample in

calibration set of 60 proteins in H2O solution.

α-Helix β-Sheet

PDBID Protein Names Observed Predicted Residual Observed Predicted Residual3V03 albumin (bovine) 0.70 0.73 -0.03 0.00 -0.01 0.01

1A06 albumin (human) 0.69 0.67 0.02 0.00 0.00 0.00

1E7H albumin (sheep) 0.71 0.55 0.16 0.00 0.03 -0.03

6ADH alcohol dehydrogenase 0.16 0.15 0.01 0.19 0.19 0.00

1QHO alpha amylase_type2 (basillus species) 0.20 0.21 -0.01 0.29 0.28 0.01

1DAY alpha casein (bovine milk) 0.36 0.37 -0.01 0.15 0.15 0.00

1HFZ alpha_lactalbumin 0.31 0.33 -0.01 0.07 0.10 -0.03

4GCR apha-crystallin(bovine eye lens) 0.03 0.03 0.00 0.46 0.44 0.02

3LDJ aprotinin (bovine lung) 0.14 0.23 -0.09 0.25 0.22 0.03

1SLF avidin 0.03 0.02 0.01 0.55 0.48 0.07

1BEB β-lactoglobulin 0.10 0.08 0.02 0.42 0.45 -0.03

1V9E carbonic anhydrase(bovine) 0.08 0.07 0.01 0.30 0.34 -0.05

3M1K carbonic anhydrase (human) 0.08 0.10 -0.02 0.29 0.35 -0.06

3NNE choline oxidase 0.21 0.18 0.03 0.21 0.25 -0.04

1YPH chymotrypsin(bovin) 0.00 0.03 -0.03 0.35 0.38 -0.03

1BKV collagen(calf skin) 0.00 0.06 -0.06 0.00 0.06 -0.06

1OVT conalbumin.dpt(chicken egg white) 0.28 0.28 0.00 0.18 0.16 0.02

3CNA concanavalin(jack bean) 0.00 -0.04 0.04 0.41 0.44 -0.03

1MPX d_amino acid dehydrogenase 0.18 0.12 0.06 0.27 0.20 0.07

2OII elastin 0.00 0.00 0.00 0.31 0.33 -0.02

1W17 esterase (bacillus sybstilis) 0.40 0.42 -0.02 0.15 0.14 0.01

1T1C esterase (rhizopus oryzae) 0.25 0.24 0.01 0.28 0.29 -0.01

3F32 ferritin (horse spleen) 0.75 0.70 0.05 0.00 0.02 -0.02

1DLP fetuin (calf serum) 0.00 0.01 -0.01 0.32 0.34 -0.02

3GHG fibrinogen(bovine plasma) 0.70 0.71 -0.01 0.02 0.00 0.02

2EIB galactose oxidase 0.01 0.01 0.00 0.38 0.38 0.00

1GAL glucose oxidase (aspegillus niger) 0.27 0.38 -0.11 0.20 0.13 0.07

2QSP hemoglobin (bovine) 0.67 0.67 0.00 0.00 0.03 -0.03

1BG3 hexakinase 0.44 0.45 -0.01 0.15 0.17 -0.02

1IGT IgG (bovine) 0.06 0.04 0.04 0.42 0.35 0.02

3AGV IgG (human) 0.08 0.05 0.03 0.37 0.39 -0.02

2OLY insulin (human) 0.69 0.67 0.02 0.12 0.08 0.04

1W2T invertase(from bakers yeast) 0.00 -0.01 0.01 0.57 0.56 0.01

3H3F l_lactic (rabbit muscle) 0.40 0.39 0.01 0.21 0.22 -0.01

1LDM lactic dehydrogenase 0.45 0.45 0.00 0.20 0.24 -0.04

1IOA lectin ( p.vulgaris) 0.02 0.00 0.02 0.43 0.37 0.06

1HNA l-glutathion (reduced) 0.51 0.47 0.04 0.09 0.10 -0.01

1CRL lipase (candida rugosa) 0.33 0.34 -0.01 0.14 0.13 0.01

2OAY native form of c1 inhibitor 0.27 0.34 -0.07 0.32 0.22 0.10

1OVA native ovalbumin 0.30 0.26 0.04 0.32 0.26 0.06

9PAP papain(papaya lanex) 0.23 0.21 0.02 0.17 0.14 0.03

1YX9 pepsin(pepsin gastric mucosa) 0.10 0.13 -0.03 0.42 0.42 0.00

1LSH phosvitin(from egg white) 0.29 0.27 0.02 0.37 0.37 0.00

2ZFG porin 0.04 0.09 -0.05 0.59 0.57 0.02

3AJ8 proteinase (from bovine pancreas) 0.24 0.28 -0.04 0.23 0.24 -0.01

1HZX rhodopsin unbleached 0.59 0.63 -0.04 0.04 0.05 -0.01

1RBX ribonucleaus A (bovine pancrease) 0.20 0.18 0.02 0.30 0.32 -0.02

3DJV ribonucleaus S (from bovine pancreas) 0.19 0.19 0.00 0.33 0.36 -0.03

1RBB ribonucleausB (bovine pancrease) 0.18 0.15 0.03 0.33 0.30 0.03

1SSN staphylokinase-B wild type 0.09 0.09 0.00 0.28 0.30 -0.02

1GNS subtilisin 0.32 0.37 -0.05 0.18 0.17 0.01

3CEI superoxide dismutase 0.51 0.45 0.06 0.10 0.16 -0.06

1RQW thaumactin (t.danielli) 0.11 0.02 0.09 0.36 0.37 -0.01

1QGD transketalose (E.coli) 0.42 0.42 0.00 0.14 0.12 0.02

3A7T trypsin (bovine) 0.07 0.05 0.02 0.32 0.32 0.00

2TGA trypsinogen(bovine pancreas) 0.07 0.11 -0.04 0.32 0.29 0.03

2J9Z tryptophan Synthase 0.32 0.36 -0.04 0.06 0.07 -0.01

1BBI typsin-chymotripsin 0.00 -0.02 0.02 0.25 0.30 -0.05

2ZCB ubiquitin (bovine) 0.16 0.25 -0.09 0.31 0.30 0.01

4H9M urease (Canavalia ensiformis) 0.26 0.28 -0.02 0.22 0.22 0.00



Table B2. Continuation of FTIR/PLS calibration results for ‘Parallel, Anti-Parallel and

Mixed β-sheet

Parallel β-Sheet Anti-Parallel β-Sheet Mixedl β-Sheet

PDBID Protein Names Observed Predicted Residual Observed Predicted Residual Observed Predicted Residual3V03 albumin (bovine) 0.00 0.00 0.00 0.00 -0.01 0.01 0.00 0.00 0.00

1A06 albumin (human) 0.00 0.00 0.00 0.00 -0.03 0.03 0.00 0.01 -0.01

1E7H albumin (sheep) 0.00 0.01 -0.01 0.00 0.03 -0.03 0.00 -0.01 0.01

6ADH alcohol dehydrogenase 0.07 0.07 0.00 0.11 0.10 0.01 0.01 0.01 0.00

1QHO alpha amylase_type2 (basillus species) 0.06 0.06 0.00 0.20 0.20 0.00 0.01 0.01 0.00

1DAY alpha casein (bovine milk) 0.00 -0.01 0.01 0.13 0.12 0.01 0.00 0.00 0.00

1HFZ alpha_lactalbumin 0.00 0.00 0.00 0.07 0.11 -0.04 0.00 0.00 0.00

4GCR apha-crystallin(bovine eye lens) 0.00 -0.01 0.01 0.38 0.39 -0.01 0.00 0.00 0.00

3LDJ aprotinin (bovine lung) 0.00 -0.01 0.01 0.25 0.25 0.00 0.00 0.00 0.00

1SLF avidin 0.00 0.01 -0.01 0.53 0.45 0.08 0.00 0.00 0.00

1BEB β-lactoglobulin 0.00 0.00 0.00 0.38 0.40 -0.02 0.00 0.00 0.00

1V9E carbonic anhydrase(bovine) 0.05 0.04 0.00 0.19 0.27 -0.08 0.05 0.04 0.01

3M1K carbonic anhydrase (human) 0.04 0.03 0.01 0.19 0.29 -0.10 0.05 0.05 0.00

3NNE choline oxidase 0.06 0.08 -0.02 0.12 0.17 -0.05 0.01 0.01 0.00

1YPH chymotrypsin(bovin) 0.00 0.01 -0.01 0.32 0.30 0.02 0.00 0.00 0.00

1BKV collagen(calf skin) 0.00 0.00 0.00 0.00 0.05 -0.05 0.00 0.00 0.00

1OVT conalbumin.dpt(chicken egg white) 0.06 0.06 0.00 0.11 0.10 0.01 0.00 0.00 0.00

3CNA concanavalin(jack bean) 0.00 0.00 0.00 0.41 0.43 -0.02 0.00 0.00 0.00

1MPX d_amino acid dehydrogenase 0.06 0.06 0.00 0.18 0.13 0.05 0.01 0.01 0.00

2OII elastin 0.00 0.00 0.00 0.29 0.30 -0.01 0.00 0.00 0.00

1W17 esterase (bacillus sybstilis) 0.13 0.13 0.00 0.02 0.04 -0.02 0.00 0.00 0.00

1T1C esterase (rhizopus oryzae) 0.00 0.00 0.00 0.27 0.27 0.00 0.00 0.01 -0.01

3F32 ferritin (horse spleen) 0.00 0.02 -0.02 0.00 0.00 0.00 0.00 0.00 0.00

1DLP fetuin (calf serum) 0.00 0.01 -0.01 0.31 0.33 -0.02 0.00 0.00 0.00

3GHG fibrinogen(bovine plasma) 0.00 -0.01 0.01 0.01 0.00 0.01 0.01 0.01 0.00

2EIB galactose oxidase 0.01 0.01 0.00 0.34 0.31 0.03 0.01 0.01 0.00

1GAL glucose oxidase (aspegillus niger) 0.05 0.05 0.00 0.12 0.09 0.03 0.01 0.01 0.00

2QSP hemoglobin (bovine) 0.00 0.01 -0.01 0.00 -0.01 0.01 0.00 0.01 -0.01

1BG3 hexakinase 0.05 0.05 0.00 0.08 0.11 -0.03 0.02 0.02 0.00

1IGT IgG (bovine) 0.00 0.00 0.00 0.42 0.34 0.08 0.00 0.00 0.00

3AGV IgG (human) 0.00 0.00 0.00 0.37 0.38 -0.01 0.00 0.00 0.00

2OLY insulin (human) 0.00 0.00 0.00 0.12 0.08 0.04 0.00 0.00 0.00

1W2T invertase(from bakers yeast) 0.01 0.01 0.00 0.49 0.50 -0.01 0.00 0.00 0.00

3H3F l_lactic (rabbit muscle) 0.08 0.08 0.00 0.11 0.17 -0.06 0.00 0.00 0.00

1LDM lactic dehydrogenase 0.08 0.08 0.00 0.11 0.14 -0.03 0.00 0.00 0.00

1IOA lectin ( p.vulgaris) 0.00 0.00 0.00 0.43 0.34 0.09 0.00 0.01 -0.01

1HNA lglutathion (reduced) 0.04 0.05 -0.01 0.04 0.04 0.00 0.01 0.02 -0.01

1CRL lipaseA (candida rugosa) 0.06 0.06 0.00 0.05 0.05 0.00 0.01 0.01 0.00

2OAY native form of c1 inhibitor 0.01 0.00 0.01 0.28 0.23 0.05 0.01 0.01 0.00

1OVA native ovalbumin 0.03 0.05 -0.02 0.22 0.14 0.08 0.05 0.05 0.00

9PAP papain(papaya lanex) 0.01 0.02 -0.01 0.15 0.12 0.03 0.01 0.01 0.00

1YX9 pepsin(pepsin gastric mucosa) 0.04 0.03 0.01 0.32 0.34 -0.02 0.03 0.03 0.00

1LSH phosvitin(from egg white) 0.00 0.00 0.00 0.35 0.34 0.01 0.00 0.00 0.00

2ZFG porin 0.00 0.00 0.00 0.56 0.55 0.01 0.00 0.00 0.00

3AJ8 proteinase (from bovine pancreas) 0.14 0.14 0.00 0.08 0.09 -0.01 0.01 0.01 0.00

1HZX rhodopsin unbleached 0.00 -0.01 0.01 0.04 0.05 -0.01 0.00 -0.01 0.01

1RBX ribonucleaus A(from bovine pancreas) 0.00 0.00 0.00 0.28 0.29 -0.01 0.01 0.01 0.00

3DJV ribnuS.dpt (from bovine pancreas) 0.00 0.00 0.00 0.31 0.33 -0.02 0.01 0.00 0.01

1RBB ribonucleausB(bovine pancrease) 0.00 0.01 -0.01 0.31 0.27 0.04 0.01 0.02 -0.01

1SSN staphylokinase-B wild type 0.07 0.06 0.01 0.14 0.14 0.00 0.06 0.06 0.00

1GNS subtilisin 0.13 0.11 0.02 0.04 0.05 -0.01 0.00 0.00 0.00

3CEI superoxide dismutase 0.00 0.00 0.00 0.09 0.11 -0.02 0.00 0.00 0.00

1RQW thaumactin (t.danielli) 0.00 -0.01 0.01 0.27 0.26 0.01 0.05 0.04 0.01

1QGD transketalose (E.coli) 0.11 0.11 0.00 0.02 0.00 0.02 0.00 0.00 0.00

3A7T trypsin (bovine) 0.00 0.00 0.00 0.30 0.30 0.00 0.00 0.00 0.00

2TGA trypsinogen(bovine pancreas) 0.00 0.00 0.00 0.30 0.30 0.00 0.00 0.00 0.00

2J9Z tryptophan Synthase 0.00 0.00 0.00 0.06 0.05 0.01 0.00 0.00 0.00

1BBI typsin-chymotripsin 0.00 0.00 0.00 0.25 0.30 -0.05 0.00 -0.01 0.01

2ZCB ubiquitin (bovine) 0.03 0.02 0.01 0.23 0.23 0.00 0.05 0.04 0.01

4H9M urease (Canavalia ensiformis) 0.07 0.06 0.01 0.13 0.13 0.00 0.01 0.01 0.00

APPENDIX

216

Table B3. Continuation of FTIR/PLS calibration results for 310-helix, turns and other

310 helix β-turns Other

PDBID Protein Names Observed Predicted Residual Observed Predicted Residual Observed Predicted Residual3V03 albumin (bovine) 0.04 0.04 0.00 0.09 0.09 0.00 0.16 0.18 -0.02

1A06 albumin (human) 0.02 0.02 0.00 0.09 0.09 0.00 0.19 0.21 -0.02

1E7H albumin (sheep) 0.02 0.03 -0.01 0.09 0.08 0.01 0.17 0.30 -0.13


1QHO alpha amylase_type2 (basillus species) 0.03 0.04 -0.01 0.14 0.14 0.00 0.34 0.36 -0.02

1DAY alpha casein (bovine milk) 0.08 0.08 0.00 0.10 0.09 0.01 0.31 0.32 -0.01

1HFZ alpha_lactalbumin 0.14 0.13 0.01 0.18 0.18 0.00 0.31 0.30 0.01

4GCR apha-Crystallin(Bovine eye lens) 0.06 0.06 0.00 0.08 0.09 -0.01 0.37 0.38 -0.01

3LDJ aprotinin (bovine lung) 0.07 0.07 0.00 0.07 0.06 0.01 0.46 0.42 0.04

1SLF avidin 0.05 0.06 -0.01 0.05 0.05 0.00 0.32 0.36 -0.04

1BEB B-lactoglobulin 0.06 0.05 0.01 0.12 0.13 -0.01 0.29 0.27 0.02

1V9E carbonic anhydrase (bovin) 0.08 0.05 0.03 0.11 0.11 0.00 0.42 0.44 -0.02


3NNE choline oxidase 0.05 0.04 0.01 0.14 0.15 -0.01 0.40 0.32 0.08


1BKV collagen(calf skin) 0.00 -0.01 0.01 0.00 0.01 -0.01 1.00 0.86 0.14

1OVT conalbumin.dpt(chicken egg white) 0.05 0.04 0.01 0.16 0.16 0.00 0.34 0.37 -0.03

3CNA concanavalin(jack bean) 0.00 0.01 -0.01 0.09 0.09 0.00 0.50 0.51 -0.01

1MPX d_amino acid dehydrogenase 0.05 0.06 -0.01 0.12 0.11 0.01 0.38 0.52 -0.14

2OII elastin 0.00 -0.01 0.01 0.08 0.07 0.01 0.61 0.59 0.02

1W17 esterase (bacillus sybstilis) 0.03 0.03 0.00 0.10 0.10 0.00 0.32 0.30 0.02

1T1C esterase (rhizopus oryzae) 0.02 0.03 -0.01 0.14 0.14 0.00 0.31 0.31 0.00

3F32 ferritin (horse spleen) 0.00 0.02 -0.02 0.07 0.09 -0.02 0.19 0.17 0.02

1DLP fetuin (calf serum) 0.00 0.00 0.00 0.09 0.09 0.00 0.58 0.59 -0.01

3GHG fibrinogen(bovine plasma) 0.02 0.02 0.00 0.06 0.05 0.01 0.21 0.22 -0.01

2EIB galactose oxidase 0.02 0.02 0.00 0.13 0.13 0.00 0.46 0.46 0.00

1GAL glucose oxidase (aspegillus niger) 0.07 0.08 -0.01 0.12 0.12 0.00 0.33 0.30 0.03

2QSP Hemoglobin bovine 0.09 0.09 0.00 0.11 0.11 0.00 0.14 0.11 0.03

1BG3 hexakinase 0.03 0.02 0.01 0.11 0.12 -0.01 0.27 0.25 0.02

1IGT IgG (bovine) 0.02 0.03 -0.01 0.08 0.07 0.01 0.42 0.45 -0.03

3AGV IgG (Human) 0.04 0.05 -0.01 0.08 0.08 0.00 0.42 0.44 -0.02

2OLY insulin (human) 0.12 0.10 0.02 0.00 0.03 -0.03 0.08 0.10 -0.02

1W2T invertase(from bakers yeast) 0.03 0.03 0.00 0.10 0.10 0.00 0.30 0.32 -0.02


1LDM lactic dehydrogenase 0.07 0.07 0.00 0.06 0.05 0.01 0.22 0.19 0.03

1IOA lectin ( p.vulgaris) 0.04 0.05 -0.01 0.14 0.14 0.00 0.37 0.46 -0.09


1CRL lipaseA (candida rugosa) 0.05 0.05 0.00 0.13 0.13 0.00 0.36 0.37 -0.01


1OVA native ovalbumin 0.03 0.04 -0.01 0.13 0.13 0.00 0.23 0.31 -0.08

9PAP papain(papaya lanex) 0.03 0.04 -0.01 0.11 0.12 -0.01 0.46 0.50 -0.04

1YX9 pepsin(pepsin gastric mucosa) 0.04 0.04 0.00 0.13 0.13 0.00 0.31 0.29 0.02

1LSH phosvitin(from egg white) 0.02 0.02 0.00 0.10 0.09 0.01 0.23 0.24 -0.01

2ZFG porin 0.00 -0.01 0.01 0.14 0.13 0.01 0.24 0.22 0.02

3AJ8 proteinase (from bovine pancreas) 0.03 0.03 0.00 0.16 0.16 0.00 0.34 0.32 0.02

1HZX rhodopsin unbleached 0.03 0.03 0.00 0.10 0.09 0.01 0.25 0.17 0.08

1RBX ribonucleaus A (bovine pancrease) 0.02 0.01 0.01 0.06 0.08 -0.02 0.41 0.40 0.01

3DJV ribonucleaus S (from bovine pancreas) 0.05 0.05 0.00 0.11 0.11 0.00 0.32 0.27 0.05

1RBB ribonucleausB (bovine pancrease) 0.03 0.04 -0.01 0.15 0.15 0.00 0.31 0.37 -0.06

1SSN staphylokinase-B wild type 0.02 0.01 0.01 0.08 0.08 0.00 0.53 0.51 0.02

1GNS subtilisin 0.00 -0.01 0.01 0.16 0.15 0.01 0.34 0.32 0.02

3CEI superoxide dismutase 0.03 0.04 -0.01 0.13 0.14 -0.01 0.23 0.21 0.02

1RQW thaumactin (t.danielli) 0.01 0.02 -0.01 0.12 0.14 -0.02 0.40 0.46 -0.06

1QGD Transketalose (E.coli) 0.05 0.05 0.00 0.13 0.14 -0.01 0.26 0.27 -0.01

3A7T trypsin (bovine) 0.03 0.03 0.00 0.15 0.15 0.00 0.43 0.47 -0.04

2TGA trypsinogen(bovine pancreas) 0.03 0.03 0.00 0.14 0.14 0.00 0.44 0.44 0.00

2J9Z Tryptophan Synthase 0.10 0.10 0.00 0.24 0.23 0.01 0.28 0.25 0.03

1BBI typsin-chymotripsin 0.00 -0.02 0.02 0.06 0.05 0.01 0.69 0.66 0.03


4H9M Urease (Canavalia ensiformis) 0.04 0.04 0.00 0.15 0.16 -0.01 0.34 0.32 0.02



Table B4. FTIR/PLS Cross-Validation results for α-helix, and β-sheet

PLS Cross-Validation

α-Helix β-Sheet

Protein Names Observed Predicted Residual Observed PredictedResidual

albumin (bovine) 0.70 0.75 -0.05 0.00 -0.01 0.01

albumin (human) 0.69 0.61 0.08 0.00 0.01 -0.01

albumin (sheep) 0.71 0.49 0.22 0.00 0.04 -0.04

alcohol dehydrogenase 0.16 0.18 -0.02 0.19 0.18 0.01

alpha amylase_type2 (basillus species) 0.20 0.13 0.07 0.29 0.29 0.00

alpha casein (bovine milk) 0.36 0.37 -0.01 0.15 0.14 0.01

alpha_lactalbumin 0.32 0.33 -0.01 0.07 0.11 -0.04

apha-Crystallin(Bovine eye lens) 0.03 0.14 -0.11 0.46 0.45 0.01

aprotinin (bovine lung) 0.14 0.16 -0.02 0.25 0.22 0.03

avidin 0.03 0.04 -0.01 0.55 0.49 0.06

B-lactoglobulin 0.10 0.12 -0.02 0.42 0.45 -0.03

carbonic anhydrase (bovin) 0.08 0.16 -0.08 0.29 0.33 -0.04

carbonic anhydrase (human) 0.08 0.16 -0.08 0.29 0.35 -0.06

choline oxidase 0.21 0.10 0.11 0.21 0.26 -0.05

chymotrypsin(bovin) 0.00 0.03 -0.03 0.35 0.38 -0.03

collagen(calf skin) 0.00 0.11 -0.11 0.00 0.07 -0.07

conalbumin.dpt(chicken egg white) 0.28 0.37 -0.09 0.18 0.15 0.03

concanavalin(jack bean) 0.00 0.08 -0.08 0.41 0.46 -0.05

d_amino acid dehydrogenase 0.18 0.10 0.08 0.27 0.20 0.07

elastin 0.00 0.12 -0.12 0.31 0.32 -0.01

esterase (bacillus sybstilis) 0.40 0.41 -0.01 0.15 0.12 0.03

esterase (rhizopus oryzae) 0.25 0.22 0.03 0.28 0.29 -0.01

ferritin (horse spleen) 0.75 0.65 0.10 0.00 0.01 -0.01

fetuin (calf serum) 0.00 0.11 -0.11 0.32 0.30 0.02

fibrinogen(bovine plasma) 0.70 0.66 0.04 0.02 0.00 0.02

galactose oxidase 0.01 0.08 -0.07 0.38 0.38 0.00

glucose oxidase (aspegillus niger) 0.27 0.38 -0.11 0.20 0.13 0.07

Hemoglobin bovine 0.67 0.65 0.02 0.00 0.03 -0.03

hexakinase 0.44 0.50 -0.06 0.15 0.18 -0.03

IgG (bovine) 0.08 -0.01 0.09 0.37 0.34 0.03

IgG (Human) 0.08 -0.12 0.20 0.37 0.40 -0.03

insulin (human) 0.69 0.51 0.18 0.12 0.08 0.04

invertase(from bakers yeast) 0.00 -0.11 0.11 0.57 0.56 0.01

l_lactic (rabbit muscle) 0.40 0.39 0.01 0.21 0.22 -0.01

lactic dehydrogenase 0.45 0.37 0.08 0.20 0.25 -0.05

lectin ( p.vulgaris) 0.02 0.03 -0.01 0.43 0.36 0.07

lglutathion (reduced) 0.51 0.47 0.04 0.09 0.10 -0.01

lipaseA (candida rugosa) 0.33 0.37 -0.04 0.14 0.13 0.01

native form of c1 inhibitor 0.27 0.29 -0.02 0.32 0.23 0.09

native ovalbumin 0.30 0.35 -0.05 0.32 0.26 0.06

papain(papaya lanex) 0.23 0.29 -0.06 0.17 0.14 0.03

pepsin(pepsin gastric mucosa) 0.10 0.08 0.02 0.42 0.41 0.01

phosvitin(from egg white) 0.29 0.35 -0.06 0.37 0.38 -0.01

porin 0.04 -0.03 0.07 0.59 0.59 0.00

proteinase (from bovine pancreas) 0.24 0.27 -0.03 0.23 0.23 0.00

rhodopsin unbleached 0.59 0.59 0.00 0.04 0.06 -0.02

ribonucleaus A (bovine pancrease) 0.20 0.13 0.07 0.30 0.30 0.00

ribonucleaus S (from bovine pancreas) 0.19 0.11 0.08 0.33 0.35 -0.02

ribonucleausB (bovine pancrease) 0.18 0.16 0.02 0.33 0.27 0.06

staphylokinase-B wild type 0.09 0.26 -0.17 0.28 0.29 -0.01

subtilisin 0.32 0.43 -0.11 0.18 0.17 0.01

superoxide dismutase 0.51 0.37 0.14 0.10 0.18 -0.08

thaumactin (t.danielli) 0.11 0.04 0.07 0.36 0.36 0.00

Transketalose (E.coli) 0.42 0.47 -0.05 0.14 0.12 0.02

trypsin (bovine) 0.07 0.13 -0.06 0.32 0.31 0.01

trypsinogen(bovine pancreas) 0.07 0.08 -0.01 0.32 0.34 -0.02

Tryptophan Synthase 0.32 0.38 -0.06 0.06 0.08 -0.02

typsin-chymotripsin 0.00 0.06 -0.06 0.25 0.30 -0.05

ubiquitin (bovine) 0.16 0.20 -0.04 0.31 0.29 0.02

Urease (Canavalia ensiformis) 0.26 0.19 0.07 0.22 0.21 0.01

APPENDIX

218

Table B5. FTIR/PLS Cross-Validation results for parallel β-sheet, antiparallel β-sheet and

mixed β-sheet


Parallel β-Sheet antiParallel β-Sheet mixed β-Sheet

PDBID Protein Names Observed Predicted Residual Observed Predicted Residual Observed Predicted Residual

3V03 albumin (bovine) 0.00 0.02 -0.02 0.00 0.00 0.00 0.00 0.00 0.00

1A06 albumin (human) 0.00 0.02 -0.02 0.00 -0.02 0.02 0.00 0.00 0.00

1E7H albumin (sheep) 0.00 0.02 -0.02 0.00 0.03 -0.03 0.00 -0.01 0.01


1QHO alpha amylase_type2 (basillus species) 0.06 0.08 -0.02 0.20 0.20 0.00 0.01 0.01 0.00

1DAY alpha casein (bovine milk) 0.00 0.01 -0.01 0.13 0.11 0.02 0.00 0.00 0.00

1HFZ alpha_lactalbumin 0.00 0.03 -0.03 0.07 0.11 -0.04 0.00 0.00 0.00

4GCR apha-Crystallin(Bovine eye lens) 0.00 0.00 0.00 0.38 0.40 -0.02 0.00 0.00 0.00

3LDJ aprotinin (bovine lung) 0.00 0.01 -0.01 0.25 0.23 0.02 0.00 -0.01 0.01

1SLF avidin 0.00 0.01 -0.01 0.53 0.45 0.08 0.00 0.01 -0.01

1BEB B-lactoglobulin 0.00 0.00 0.00 0.38 0.41 -0.03 0.00 -0.01 0.01

1V9E carbonic anhydrase (bovin) 0.04 0.01 0.03 0.19 0.27 -0.08 0.05 0.03 0.02


3NNE choline oxidase 0.06 0.04 0.02 0.12 0.18 -0.06 0.01 0.00 0.01

1YPH chymotrypsin(bovin) 0.00 0.02 -0.02 0.32 0.30 0.02 0.00 0.01 -0.01

1BKV collagen(calf skin) 0.00 0.03 -0.03 0.00 0.05 -0.05 0.00 0.00 0.00


3CNA concanavalin(jack bean) 0.00 0.04 -0.04 0.41 0.43 -0.02 0.00 0.00 0.00

1MPX d_amino acid dehydrogenase 0.06 0.03 0.03 0.18 0.13 0.05 0.01 0.01 0.00

2OII elastin 0.00 0.03 -0.03 0.29 0.30 -0.01 0.00 0.00 0.00

1W17 esterase (bacillus sybstilis) 0.13 0.03 0.10 0.02 0.04 -0.02 0.00 0.00 0.00

1T1C esterase (rhizopus oryzae) 0.00 0.04 -0.04 0.27 0.26 0.01 0.00 0.00 0.00

3F32 ferritin (horse spleen) 0.00 0.01 -0.01 0.00 0.00 0.00 0.00 0.00 0.00

1DLP fetuin (calf serum) 0.00 -0.01 0.01 0.31 0.29 0.02 0.00 0.00 0.00

3GHG fibrinogen(bovine plasma) 0.00 0.03 -0.03 0.01 0.00 0.01 0.01 0.00 0.01

2EIB galactose oxidase 0.01 -0.03 0.04 0.34 0.35 -0.01 0.01 0.01 0.00


2QSP Hemoglobin bovine 0.00 0.03 -0.03 0.00 -0.01 0.01 0.00 0.01 -0.01

1BG3 hexakinase 0.05 0.04 0.01 0.08 0.11 -0.03 0.02 0.02 0.00

1IGT IgG (bovine) 0.00 0.01 -0.01 0.37 0.32 0.05 0.00 0.01 -0.01

3AGV IgG (Human) 0.00 0.00 0.00 0.37 0.39 -0.02 0.00 0.01 -0.01

2OLY insulin (human) 0.00 0.01 -0.01 0.12 0.08 0.04 0.00 0.00 0.00

1W2T invertase(from bakers yeast) 0.01 0.01 0.00 0.49 0.47 0.02 0.00 0.00 0.00


1LDM lactic dehydrogenase 0.08 0.05 0.03 0.11 0.15 -0.04 0.00 0.01 -0.01


1HNA lglutathion (reduced) 0.04 0.02 0.02 0.04 0.04 0.00 0.01 0.02 -0.01

1CRL lipaseA (candida rugosa) 0.06 0.04 0.02 0.05 0.04 0.01 0.01 0.01 0.00



9PAP papain(papaya lanex) 0.01 0.05 -0.04 0.15 0.11 0.04 0.01 0.00 0.01

1YX9 pepsin(pepsin gastric mucosa) 0.04 0.02 0.02 0.32 0.33 -0.01 0.03 0.03 0.00

1LSH phosvitin(from egg white) 0.00 0.02 -0.02 0.35 0.34 0.01 0.00 0.01 -0.01

2ZFG porin 0.00 0.01 -0.01 0.56 0.57 -0.01 0.00 0.00 0.00

3AJ8 proteinase (from bovine pancreas) 0.14 0.11 0.03 0.08 0.09 -0.01 0.01 0.01 0.00

1HZX rhodopsin unbleached 0.00 0.03 -0.03 0.04 0.04 0.00 0.00 -0.01 0.01

1RBX ribonucleaus A (bovine pancrease) 0.00 0.01 -0.01 0.28 0.28 0.00 0.01 0.02 -0.01

3DJV ribonucleaus S (from bovine pancreas) 0.00 0.01 -0.01 0.31 0.32 -0.01 0.01 0.01 0.00


1SSN staphylokinase-B wild type 0.07 0.03 0.04 0.14 0.15 -0.01 0.06 0.06 0.00

1GNS subtilisin 0.13 0.04 0.09 0.04 0.06 -0.02 0.00 0.01 -0.01

3CEI superoxide dismutase 0.00 0.02 -0.02 0.09 0.13 -0.04 0.00 0.01 -0.01

1RQW thaumactin (t.danielli) 0.00 0.01 -0.01 0.27 0.27 0.00 0.05 0.05 0.00

1QGD Transketalose (E.coli) 0.11 0.11 0.00 0.02 -0.01 0.03 0.00 0.00 0.00

3A7T trypsin (bovine) 0.00 0.03 -0.03 0.30 0.30 0.00 0.00 0.00 0.00

2TGA trypsinogen(bovine pancreas) 0.00 0.00 0.00 0.30 0.33 -0.03 0.00 0.00 0.00

2J9Z Tryptophan Synthase 0.00 0.01 -0.01 0.06 0.06 0.00 0.00 0.01 -0.01

1BBI typsin-chymotripsin 0.00 0.01 -0.01 0.25 0.29 -0.04 0.00 0.01 -0.01


4H9M Urease (Canavalia ensiformis) 0.07 0.08 -0.01 0.13 0.11 0.02 0.01 0.01 0.00



Table B6. FTIR/PLS Cross-Validation results for 310-helix, turns and other




3V03 albumin (bovine) 0.04 0.03 0.01 0.09 0.11 -0.02 0.16 0.13 0.03

1A06 albumin (human) 0.02 0.03 -0.01 0.09 0.09 0.00 0.19 0.25 -0.06

1E7H albumin (sheep) 0.02 0.04 -0.02 0.09 0.08 0.01 0.17 0.30 -0.13


1QHO alpha amylase_type2 (basillus species) 0.03 0.05 -0.02 0.14 0.14 0.00 0.34 0.32 0.02

1DAY alpha casein (bovine milk) 0.08 0.06 0.02 0.10 0.10 0.00 0.31 0.30 0.01


4GCR apha-Crystallin(Bovine eye lens) 0.06 0.05 0.01 0.08 0.08 0.00 0.37 0.31 0.06


1SLF avidin 0.05 0.05 0.00 0.05 0.09 -0.04 0.32 0.34 -0.02

1BEB B-lactoglobulin 0.06 0.05 0.01 0.12 0.12 0.00 0.29 0.28 0.01

1V9E carbonic anhydrase (bovin) 0.08 0.06 0.02 0.11 0.11 0.00 0.44 0.37 0.07


3NNE choline oxidase 0.05 0.06 -0.01 0.14 0.15 -0.01 0.40 0.38 0.02




3CNA concanavalin(jack bean) 0.00 0.02 -0.02 0.09 0.10 -0.01 0.50 0.42 0.08

1MPX d_amino acid dehydrogenase 0.05 0.04 0.01 0.12 0.09 0.03 0.38 0.61 -0.23

2OII elastin 0.00 0.01 -0.01 0.08 0.05 0.03 0.61 0.55 0.06

1W17 esterase (bacillus sybstilis) 0.03 0.05 -0.02 0.10 0.10 0.00 0.32 0.31 0.01

1T1C esterase (rhizopus oryzae) 0.02 0.01 0.01 0.14 0.14 0.00 0.31 0.30 0.01

3F32 ferritin (horse spleen) 0.00 0.04 -0.04 0.07 0.08 -0.01 0.19 0.22 -0.03

1DLP fetuin (calf serum) 0.00 0.00 0.00 0.09 0.05 0.04 0.58 0.62 -0.04

3GHG fibrinogen(bovine plasma) 0.02 0.02 0.00 0.06 0.09 -0.03 0.21 0.26 -0.05

2EIB galactose oxidase 0.02 0.03 -0.01 0.13 0.14 -0.01 0.46 0.47 -0.01


2QSP Hemoglobin bovine 0.09 0.08 0.01 0.11 0.12 -0.01 0.14 0.21 -0.07

1BG3 hexakinase 0.03 0.00 0.03 0.11 0.11 0.00 0.27 0.15 0.12

1IGT IgG (bovine) 0.04 0.07 -0.03 0.08 0.10 -0.02 0.42 0.44 -0.02

3AGV IgG (Human) 0.04 0.02 0.02 0.08 0.07 0.01 0.42 0.51 -0.09


1W2T invertase(from bakers yeast) 0.03 0.02 0.01 0.10 0.10 0.00 0.30 0.43 -0.13

3H3F l_lactic (rabbit muscle) 0.04 0.05 -0.01 0.11 0.11 0.00 0.24 0.26 -0.02

1LDM lactic dehydrogenase 0.07 0.05 0.02 0.06 0.05 0.01 0.22 0.32 -0.10



1CRL lipaseA (candida rugosa) 0.05 0.05 0.00 0.13 0.16 -0.03 0.36 0.45 -0.09

2OAY native form of c1 inhibitor 0.02 0.04 -0.02 0.09 0.10 -0.01 0.30 0.31 -0.01

1OVA native ovalbumin 0.03 0.05 -0.02 0.13 0.14 -0.01 0.23 0.28 -0.05

9PAP papain(papaya lanex) 0.03 0.03 0.00 0.11 0.12 -0.01 0.46 0.42 0.04

1YX9 pepsin(pepsin gastric mucosa) 0.04 0.06 -0.02 0.13 0.14 -0.01 0.31 0.30 0.01

1LSH phosvitin(from egg white) 0.02 0.02 0.00 0.10 0.09 0.01 0.23 0.13 0.10

2ZFG porin 0.00 -0.01 0.01 0.14 0.12 0.02 0.24 0.33 -0.09

3AJ8 proteinase (from bovine pancreas) 0.03 0.01 0.02 0.16 0.20 -0.04 0.34 0.44 -0.10

1HZX rhodopsin unbleached 0.03 0.04 -0.01 0.10 0.11 -0.01 0.25 0.23 0.02

1RBX ribonucleaus A (bovine pancrease) 0.02 0.03 -0.01 0.06 0.09 -0.03 0.41 0.43 -0.02

3DJV ribonucleaus S (from bovine pancreas) 0.05 0.05 0.00 0.11 0.13 -0.02 0.32 0.38 -0.06


1SSN staphylokinase-B wild type 0.02 0.02 0.00 0.08 0.09 -0.01 0.53 0.37 0.16

1GNS subtilisin 0.00 0.01 -0.01 0.16 0.12 0.04 0.34 0.32 0.02

3CEI superoxide dismutase 0.03 0.04 -0.01 0.13 0.12 0.01 0.23 0.28 -0.05

1RQW thaumactin (t.danielli) 0.01 0.04 -0.03 0.12 0.12 0.00 0.40 0.41 -0.01

1QGD Transketalose (E.coli) 0.05 0.06 -0.01 0.13 0.15 -0.02 0.26 0.15 0.11

3A7T trypsin (bovine) 0.03 0.02 0.01 0.15 0.16 -0.01 0.43 0.41 0.02

2TGA trypsinogen(bovine pancreas) 0.03 0.02 0.01 0.14 0.16 -0.02 0.44 0.46 -0.02

2J9Z Tryptophan Synthase 0.10 0.08 0.02 0.24 0.16 0.08 0.28 0.26 0.02

1BBI typsin-chymotripsin 0.00 0.04 -0.04 0.06 0.13 -0.07 0.69 0.41 0.28

2ZCB ubiquitin (bovine) 0.08 0.06 0.02 0.16 0.11 0.05 0.28 0.33 -0.05

4H9M Urease (Canavalia ensiformis) 0.04 0.03 0.01 0.15 0.14 0.01 0.34 0.37 -0.03

APPENDIX

220

Table B7. Result of 44 samples in the test set used for Partial Least Squares (PLS)

analysis. About 20 spectra from calibration set were included in the test set for validation

purpose (α-helix and β-sheet only).

α-Helix β-Sheet

PDBID Protein Names Observed Predicted Residual Observed Predicted Residual

1E7H albumin (sheep) 0.71 0.49 0.22 0.00 0.04 -0.04

1DAY alpha casein (bovine milk) 0.36 0.37 -0.01 0.15 0.14 0.01

1HFZ alpha_lactalbumin 0.31 0.33 -0.02 0.07 0.11 -0.04

3CWL alpha-1-antitypsin 0.28 0.18 0.10 0.31 0.33 -0.02

1YJ0 annexinV/without calcium 0.68 0.53 0.15 0.00 0.08 -0.08

1AS4 anti-chymotrypsin 0.32 0.25 0.07 0.37 0.30 0.07

3LDJ aprotinin (bovine lung) 0.14 0.16 -0.02 0.25 0.22 0.03

1SLF avidin 0.03 0.04 -0.01 0.55 0.49 0.06

2BRD bacteriorhodopsin in purple membrane 0.77 0.35 0.42 0.00 0.20 -0.20

3IF7 calmodulin 0.59 0.51 0.08 0.06 0.08 -0.02

3M1K carbonic anhydrase (human) 0.08 0.16 -0.08 0.29 0.33 -0.04

1YPH chymotrypsin(bovin) 0.00 0.03 -0.03 0.35 0.38 -0.03

1ATH cleaved form of anti-thrombin 0.27 0.28 -0.01 0.32 0.30 0.02

1BKV collagen(calf skin) 0.00 0.11 -0.11 0.00 0.07 -0.07

1HRC cytochrome (horse heart) 0.41 0.49 -0.08 0.00 0.02 -0.02

1GAL glucose oxidase (aspegillus niger) 0.27 0.38 -0.11 0.20 0.13 0.07

1IGT IgG (bovine) 0.06 -0.01 0.07 0.42 0.34 0.08

2OLY insulin (human) 0.69 0.51 0.18 0.12 0.08 0.04

3H3F l_lactic (rabbit muscle) 0.40 0.39 0.01 0.21 0.22 -0.01

1IOA lectin ( p.vulgaris) 0.02 0.03 -0.01 0.43 0.36 0.07

1HNA l-glutathion (reduced) 0.51 0.47 0.04 0.09 0.10 -0.01

1OIL lipase (wheat) 0.36 0.34 0.02 0.18 0.19 -0.01

2LYZ lysozyme (chicken egg) 0.32 0.40 -0.08 0.06 0.12 -0.06

2MLT melittin pH 6 0.88 0.47 0.41 0.00 0.12 -0.12

1NZ2 myoglobulin(horse muscle) 0.74 0.62 0.12 0.00 0.01 -0.01

3DLW native anti-chymotrypsin 0.29 0.29 0.00 0.36 0.25 0.11

1T1F native form of anti-thrombin 0.26 0.27 -0.01 0.27 0.32 -0.05

1OVA native ovalbumin 0.30 0.35 -0.05 0.32 0.26 0.06

1HCH peroxidase (horseradish) 0.44 0.35 0.09 0.02 0.17 -0.15

2AXT phtosystem II reaction centre 0.52 0.50 0.02 0.01 0.12 -0.11

3ENJ pig citrate synthase 0.59 0.49 0.10 0.03 0.10 -0.07

1T5A pyruvate kinase 0.36 0.51 -0.15 0.20 0.11 0.09

1QOV Reaction centere R. Sphae. 0.18 0.38 -0.20 0.22 0.21 0.01

2I37 rhodopsin bleached 0.55 0.61 -0.06 0.04 0.03 0.01

1HZX rhodopsin unbleached 0.59 0.59 0.00 0.04 0.06 -0.02

2ACH sact1 DIFF 0.31 0.25 0.06 0.34 0.30 0.04

1LQ8 split form of c1 inhibitor 0.28 0.29 -0.01 0.35 0.25 0.10

1H3E split form ovalbumin 0.41 0.35 0.06 0.13 0.27 -0.14

3CEI superoxide dismutase 0.51 0.37 0.14 0.10 0.18 -0.08

1RQW thaumactin (t.danielli) 0.11 0.04 0.07 0.36 0.36 0.00

3A7T trypsin 0.07 0.22 -0.15 0.32 0.27 0.05

1UHB trypsin (hog pancreas) 0.00 0.01 -0.01 0.34 0.40 -0.06

2J9Z tryptophan Synthase 0.32 0.38 -0.06 0.06 0.08 -0.02

2ZCB ubiquitin (bovine) 0.16 0.20 -0.04 0.31 0.29 0.02





purpose (parallel β-sheet, parallel β-sheet, and mixed β-sheet).


1E7H albumin (sheep) 0.00 0.02 -0.02 0.00 0.02 -0.02 0.00 0.00 0.00

1DAY alpha casein (bovine milk) 0.00 0.01 -0.01 0.13 0.11 0.02 0.00 0.01 -0.01

1HFZ alpha_lactalbumin 0.00 0.03 -0.03 0.07 0.11 -0.04 0.00 0.00 0.00

3CWL alpha-1-antitypsin' 0.03 0.33 -0.30 0.21 0.21 0.27 0.06 0.02 0.04

1YJ0 annexinV/without calcium 0.00 0.03 -0.03 0.00 0.04 -0.04 0.00 0.01 -0.01

1AS4 anti-chymotrypsin 0.02 0.02 0.00 0.32 0.25 0.07 0.02 0.02 0.00

3LDJ aprotinin (bovine lung) 0.00 0.01 -0.01 0.25 0.22 0.03 0.00 -0.01 0.01

1SLF avidin 0.00 0.01 -0.01 0.53 0.45 0.08 0.00 0.01 -0.01

2BRD bacteriorhodopsin in purple membrane 0.00 0.02 -0.02 0.00 0.18 -0.18 0.00 0.01 -0.01

3IF7 calmodulin 0.00 0.03 -0.03 0.06 0.05 0.01 0.00 0.01 -0.01


1YPH chymotrypsin(bovin) 0.00 0.02 -0.02 0.32 0.30 0.02 0.00 0.01 -0.01

1ATH cleaved form of anti-thrombin 0.03 0.02 0.01 0.25 0.26 -0.01 0.03 0.01 0.02

1BKV collagen(calf skin) 0.00 0.03 -0.03 0.00 0.06 -0.06 0.00 -0.01 0.01

1HRC cytochrome (horse heart) 0.00 0.02 -0.02 0.00 0.01 -0.01 0.00 0.00 0.00


1IGT IgG (bovine) 0.00 0.01 -0.01 0.42 0.32 0.10 0.00 0.02 -0.02

2OLY insulin (human) 0.00 0.01 -0.01 0.12 0.07 0.05 0.00 0.01 -0.01

3H3F l_lactic (rabbit muscle) 0.08 0.02 0.06 0.11 0.15 -0.04 0.00 0.01 -0.01


1HNA lglutathion (reduced) 0.04 0.02 0.02 0.04 0.05 -0.01 0.01 0.02 -0.01

1OIL lipase (wheat) 0.09 0.02 0.07 0.07 0.14 -0.07 0.00 0.01 -0.01

2LYZ lysozyme (chicken egg) 0.00 0.02 -0.02 0.06 0.09 -0.03 0.00 0.01 -0.01

2MLT melittin pH 6 0.00 0.03 -0.03 0.00 0.07 -0.07 0.00 0.01 -0.01

1NZ2 myoglobulin(horse muscle) 0.00 0.02 -0.02 0.00 -0.03 0.03 0.00 0.00 0.00

3DLW native anti-chymotrypsin 0.02 0.02 0.00 0.31 0.19 0.12 0.02 0.02 0.00

1T1F native form of anti-thrombin 0.03 0.03 0.00 0.21 0.24 -0.03 0.03 0.02 0.01


1HCH peroxidase (horseradish) 0.00 0.02 -0.02 0.02 0.13 -0.11 0.00 0.01 -0.01

2AXT phtosystem II reaction centre 0.01 0.02 -0.01 0.00 0.11 -0.11 0.00 0.01 -0.01

3ENJ pig citrate synthase 0.00 0.03 -0.03 0.03 0.07 -0.04 0.00 0.01 -0.01

1T5A pyruvate kinase 0.10 0.02 0.08 0.08 0.09 -0.01 0.01 0.01 0.00

1QOV Reaction centere R. Sphae. 0.00 0.02 -0.02 0.18 0.17 0.01 0.02 0.01 0.01

2I37 rhodopsin bleached 0.00 0.03 -0.03 0.04 0.01 0.03 0.00 0.01 -0.01

1HZX rhodopsin unbleached 0.00 0.03 -0.03 0.04 0.04 0.00 0.00 0.00 0.00

2ACH sact1 DIFF 0.02 0.02 0.00 0.29 0.25 0.04 0.02 0.02 0.00

1LQ8 split form of c1 inhibitor 0.02 0.01 0.01 0.30 0.25 0.05 0.02 0.01 0.01

1H3E split form ovalbumin 0.07 0.04 0.03 0.05 0.12 -0.07 0.01 0.03 -0.02

3CEI superoxide dismutase 0.00 0.02 -0.02 0.09 0.14 -0.05 0.00 0.01 -0.01

1RQW thaumactin (t.danielli) 0.00 0.01 -0.01 0.27 0.27 0.00 0.05 0.02 0.03

3A7T trypsin 0.00 0.03 -0.03 0.30 0.20 0.10 0.00 0.02 -0.02

1UHB trypsin (hog pancreas) 0.00 0.02 -0.02 0.33 0.32 0.01 0.00 0.02 -0.02

2J9Z tryptophan Synthase 0.00 0.01 -0.01 0.06 0.07 -0.01 0.00 0.02 -0.02


APPENDIX

222



purpose (310-helix, β-turns and other).


1E7H albumin (sheep) 0.02 0.04 -0.02 0.09 0.08 0.01 0.17 0.30 -0.13

1DAY alpha casein (bovine milk) 0.08 0.06 0.02 0.10 0.10 0.00 0.31 0.30 0.01


3CWL alpha-1-antitypsin' 0.02 0.04 -0.02 0.10 0.12 -0.02 0.28 0.35 -0.07

1YJ0 annexinV/without calcium 0.10 0.04 0.06 0.07 0.12 -0.05 0.25 0.25 0.00

1AS4 anti-chymotrypsin 0.02 0.04 -0.02 0.09 0.12 -0.03 0.20 0.31 -0.11


1SLF avidin 0.05 0.05 0.00 0.05 0.09 -0.04 0.32 0.34 -0.02

2BRD bacteriorhodopsin in purple membrane 0.00 0.04 -0.04 0.02 0.12 -0.10 0.21 0.29 -0.08

3IF7 calmodulin 0.00 0.04 -0.04 0.07 0.11 -0.04 0.28 0.25 0.03

3M1K carbonic anhydrase (human) 0.08 0.05 0.03 0.11 0.11 0.00 0.44 0.37 0.07


1ATH cleaved form of anti-thrombin 0.02 0.03 -0.01 0.11 0.11 0.00 0.28 0.29 -0.01


1HRC cytochrome (horse heart) 0.00 0.05 -0.05 0.17 0.10 0.07 0.41 0.29 0.12


1IGT IgG (bovine) 0.02 0.04 -0.02 0.08 0.10 -0.02 0.42 0.44 -0.02


3H3F l_lactic (rabbit muscle) 0.04 0.06 -0.02 0.11 0.11 0.00 0.24 0.26 -0.02



1OIL lipase (wheat) 0.02 0.05 -0.03 0.16 0.11 0.05 0.28 0.31 -0.03

2LYZ lysozyme (chicken egg) 0.10 0.05 0.05 0.24 0.10 0.14 0.28 0.30 -0.02

2MLT melittin pH 6 0.00 0.04 -0.04 0.04 0.11 -0.07 0.08 0.26 -0.18

1NZ2 myoglobulin(horse muscle) 0.00 0.06 -0.06 0.12 0.09 0.03 0.14 0.23 -0.09

3DLW native anti-chymotrypsin 0.02 0.04 -0.02 0.07 0.12 -0.05 0.27 0.31 -0.04

1T1F native form of anti-thrombin 0.04 0.03 0.01 0.10 0.12 -0.02 0.33 0.30 0.03

1OVA native ovalbumin 0.03 0.04 -0.01 0.13 0.14 -0.01 0.23 0.28 -0.05

1HCH peroxidase (horseradish) 0.06 0.05 0.01 0.17 0.10 0.07 0.31 0.31 0.00

2AXT phtosystem II reaction centre 0.04 0.04 0.00 0.10 0.11 -0.01 0.33 0.24 0.09

3ENJ pig citrate synthase 0.03 0.04 -0.01 0.10 0.12 -0.02 0.24 0.25 -0.01

1T5A pyruvate kinase 0.04 0.04 0.00 0.10 0.10 0.00 0.30 0.24 0.06

1QOV Reaction centere R. Sphae. 0.06 0.04 0.02 0.12 0.12 0.00 0.42 0.27 0.15

2I37 rhodopsin bleached 0.04 0.03 0.01 0.12 0.11 0.01 0.25 0.23 0.02

1HZX rhodopsin unbleached 0.03 0.04 -0.01 0.10 0.11 -0.01 0.25 0.23 0.02

2ACH sact1 DIFF 0.02 0.04 -0.02 0.09 0.12 -0.03 0.23 0.31 -0.08

1LQ8 split form of c1 inhibitor 0.04 0.03 0.01 0.08 0.11 -0.03 0.26 0.31 -0.05

1H3E split form ovalbumin 0.04 0.04 0.00 0.14 0.14 0.00 0.28 0.29 -0.01

3CEI superoxide dismutase 0.03 0.04 -0.01 0.13 0.12 0.01 0.23 0.28 -0.05

1RQW thaumactin (t.danielli) 0.01 0.04 -0.03 0.12 0.12 0.00 0.40 0.41 -0.01

3A7T trypsin 0.03 0.04 -0.01 0.15 0.12 0.03 0.43 0.35 0.08

1UHB trypsin (hog pancreas) 0.02 0.04 -0.02 0.12 0.13 -0.01 0.51 0.42 0.09

2J9Z tryptophan Synthase 0.10 0.08 0.02 0.24 0.16 0.08 0.28 0.26 0.02

2ZCB ubiquitin (bovine) 0.08 0.06 0.02 0.16 0.11 0.05 0.28 0.33 -0.05



Table B10. Results for FTIR/PLS comparison of proteins in H2O and D2O.

(For α-helix and β-sheet results, see chapter 4, section 4.6.4.)

Table B11. Continuation of Table B7 for comparison of proteins in H2O and D2O results.

α-Helix β-Sheet

Num PDBID Protein Names

Observe

d (X-

ray)

Predicted

(proteins

in H2O)

Predicted

(proteins

in D2O)

Observed

(X-ray)

Predicted

(proteins in

H2O)

Predicted

(proteins in

D2O)

1 3CWL alpha-1-antitypsin 0.28 0.30 0.28 0.31 0.33 0.33

2 3M1K carbonic anhydrase 0.08 0.10 0.14 0.29 0.35 0.29

3 1YPH chymotrypsin(bovin) 0.00 0.03 0.00 0.35 0.38 0.35

4 1ATH cleaved form of anti-thrombin 0.27 0.28 0.26 0.32 0.30 0.33

5 1HRC cytochrome (Horse heart) 0.41 0.49 0.47 0.00 0.02 0.01

6 2QSP Hemoglobin bovine 0.67 0.67 0.66 0.00 0.03 0.02

7 2LYZ lysozyme 0.32 0.28 0.31 0.06 0.70 0.06

8 1NZ2 myoglobulin(horse muscle) 0.74 0.62 0.70 0.00 0.01 0.04

9 1T1F native form of anti-thrombin 0.26 0.27 0.26 0.27 0.28 0.25

10 2OAY native form of c1 inhibitor 0.27 0.34 0.32 0.32 0.28 0.30

11 1YX9 pepsin(pepsin gastric mucosa) 0.10 0.13 0.11 0.42 0.42 0.42

12 1RBX ribnuA(from bovine pancreas) 0.20 0.18 0.18 0.30 0.32 -0.02

13 3DJV ribnuS.dpt (from bovine pancreas) 0.19 0.19 0.20 0.33 0.36 0.32

14 1LQ8 split form of c1 inhibitor 0.28 0.29 0.29 0.35 0.34 0.34

15 3CEI superoxide dismutase 0.51 0.45 0.47 0.10 0.11 0.12

16 2TGA trypsinogen(bovine pancreas) 0.07 0.11 0.06 0.32 0.29 0.32

Parallel β-Sheet Anti-Parallel β-Sheet Mixedl β-Sheet


Observed

(X-ray)

Predicted

(proteins in

H2O)

Predicted

(proteins in

D2O)

Observe

d (X-

ray)

Predicted

(proteins in

H2O)

Predicted

(proteins

in D2O)

Observe

d (X-

ray)

Predicted

(proteins in

H2O)

Predicted

(proteins

in D2O)

1 3CWL alpha-1-antitypsin 0.03 0.03 0.00 0.21 0.21 0.20 0.06 0.02 0.03

2 3M1K carbonic anhydrase 0.04 0.03 0.02 0.19 0.19 0.19 0.05 0.05 0.04

3 1YPH chymotrypsin(bovin) 0.00 0.02 0.00 0.32 0.30 0.31 0.00 0.01 0.00

4 1ATH cleaved form of anti-thrombin 0.03 0.02 0.01 0.31 0.28 0.29 0.03 0.01 0.02

5 1HRC cytochrome (Horse heart) 0.00 0.02 0.00 0.00 0.01 0.01 0.00 0.00 0.00

6 2QSP Hemoglobin bovine 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00

7 2LYZ lysozyme 0.00 0.02 0.01 0.06 0.09 0.07 0.00 0.01 0.01

8 1NZ2 myoglobulin(horse muscle) 0.00 0.02 0.00 0.00 -0.03 0.00 0.00 0.00 0.00

9 1T1F native form of anti-thrombin 0.03 0.03 0.00 0.21 0.24 0.20 0.03 0.02 0.03

10 2OAY native form of c1 inhibitor 0.01 0.00 0.01 0.28 0.28 0.27 0.01 0.01 0.02

11 1YX9 pepsin(pepsin gastric mucosa) 0.04 0.03 0.01 0.32 0.32 0.33 0.03 0.03 0.03

12 1RBX ribnuA(from bovine pancreas) 0.00 0.00 0.00 0.28 0.28 0.27 0.01 0.01 0.01

13 3DJV ribnuS.dpt (from bovine pancreas) 0.00 0.00 0.00 0.31 0.31 0.31 0.01 0.00 0.01

14 1LQ8 split form of c1 inhibitor 0.02 0.01 0.01 0.30 0.25 0.27 0.02 0.01 0.02

15 3CEI superoxide dismutase 0.00 0.00 0.00 0.09 0.09 0.08 0.00 0.00 0.00

16 2TGA trypsinogen(bovine pancreas) 0.00 0.00 0.00 0.30 0.30 0.30 0.00 0.00 0.01

APPENDIX

224

Table B12. Continuation of Table B8 for comparison of proteins in H2O and D2O results.



Observe

d (X-

ray)

Predicted

(proteins

in H2O)

Predicted

(proteins

in D2O)

Observe

d (X-

ray)

Predicted

(proteins

in H2O)

Predicted

(proteins

in D2O)

Observe

d (X-

ray)

Predicted

(proteins in

H2O)

Predicted

(proteins

in D2O)

1 3CWL alpha-1-antitypsin 0.02 0.04 0.04 0.10 0.12 0.11 0.28 0.24 0.25

2 3M1K carbonic anhydrase 0.08 0.07 0.06 0.11 0.12 0.11 0.44 0.44 0.44

3 1YPH chymotrypsin(bovin) 0.02 0.04 0.03 0.18 0.14 0.17 0.45 0.40 0.40

4 1ATH cleaved form of anti-thrombin 0.02 0.03 0.03 0.11 0.11 0.11 0.28 0.29 0.29

5 1HRC cytochrome (Horse heart) 0.00 0.05 0.04 0.17 0.16 0.16 0.41 0.41 0.42

6 2QSP Hemoglobin bovine 0.09 0.09 0.07 0.11 0.11 0.13 0.14 0.14 0.14

7 2LYZ lysozyme 0.10 0.05 0.09 0.24 0.22 0.23 0.28 0.30 0.28

8 1NZ2 myoglobulin(horse muscle) 0.00 0.06 0.00 0.12 0.09 0.11 0.14 0.15 0.13

9 1T1F native form of anti-thrombin 0.04 0.03 0.03 0.10 0.12 0.12 0.33 0.30 0.32

10 2OAY native form of c1 inhibitor 0.02 0.02 0.02 0.09 0.08 0.08 0.30 0.30 0.30

11 1YX9 pepsin(pepsin gastric mucosa) 0.04 0.04 0.03 0.13 0.13 0.13 0.31 0.31 0.32

12 1RBX ribnuA(from bovine pancreas) 0.02 0.01 0.02 0.06 0.08 0.07 0.41 0.41 0.39

13 3DJV ribnuS.dpt (from bovine pancreas) 0.05 0.05 0.06 0.11 0.11 0.11 0.32 0.32 0.32

14 1LQ8 split form of c1 inhibitor 0.04 0.03 0.04 0.08 0.11 0.09 0.26 0.31 0.28

15 3CEI superoxide dismutase 0.03 0.04 0.04 0.13 0.14 0.14 0.23 0.23 0.23

16 2TGA trypsinogen(bovine pancreas) 0.03 0.03 0.03 0.14 0.14 0.14 0.44 0.44 0.44



Appendix C

Appendix C contains PCA plots for D2O and plots for Discriminant Function

Analysis (DFA) for Amide I and Amide II not used in the main body of this thesis;

the protein data set is the same and the results from these analyses were quite

favourable.

A usefulness of DFA is in maximising the differences among sample groups;

however, the score plots in the PCA analysis in Chapter 3 reviled well clustered

protein samples and served the purpose for our data exploration. See Chapter 3.

APPENDIX

226

Figure C1. Principal Component plot of Amide I region of 16 proteins in D2O showing

percentage variance of the first 4 PCs. Blue line indicates that the accumulative percentage

variance is about 98%.

Figure C2. PCA Score plot of the Amide I region for 16 proteins in D2O. The plot show

lysozyme associated with α-helix on the left. It was wrongly marked as ‘other’ before the

analysis. β-sheet proteins are clustered above right while ‘other’ appears below right.

1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

principal comp

varia

nce e

xpla

ined (

%)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

-4 -3 -2 -1 0 1 2-3

-2

-1

0

1

2

3

First Principal Component

Second P

rincip

al C

om

ponent

Lysozymecytochrome

Hemoglobin

myoglobulin

chymotrypsin

carbonic anhydrase

pepsinsplit-anti-thrombin

ribnuA ribnuS fewdays

nt-anti-chymotrypnative-anti-thrombin

native-c1 inhibitor

split- c1 inhibitor

trypsinogen

other

beta

alpha



Figure C3. PCA loading spectra plot of the Amide I region for 16 proteins in D2O buffer

Figure C4. This loading plot indicates the spectral regions responsible for the

discriminations of the proteins. This plot does correlate well with Figure c3 above; it

highlights the intensity wavenumbers associated with the clustering of the proteins.

1653cm-1

is the intensity peak for α-helix and 1630 cm-1

is the associated with β-sheet

proteins on the right of Figure c1.

1600 1610 1620 1630 1640 1650 1660 1670 1680 1690 1700-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

Wavenumber in (cm-1

)

Inte

nsity

1630

1638

1612

1624

1650

1653

1672

1679

Loading for PC1

Loading for PC2

-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.3-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Loading for PC1

Loadin

g for P

C2

Loading Plot

16721680

16531650

1612

1630

1637164016411642

1643

16861663

16591661

1656

1667

16041607

16011600

1620

1634

1

2

3

4

APPENDIX

228

Figure C5. This plot of the Amide I region of FTIR protein spectra was produced from

Discriminant Function Analysis. The first DF discriminates between alpha-helix on the

right and beta-sheet proteins on the left; the second DF separates the ‘other’ proteins,

mainly above zero but clustered in the middle, from the rest of the data.

Figure C6. This plot of the Amide II region of FTIR protein spectra was produced from

Discriminant Function Analysis. The first DF discriminates between α-helix on the left

and β-sheet proteins on the right; the second DF separates the ‘other’ proteins, mainly

above zero but clustered in the middle, from the rest of the data.

-4 -2 0 2 4-0.4

-0.2

0

0.2

0.4

First Discriminant AnalysisSecond D

iscrim

inant

Analy

sis

other

-sheet

-helix

-3 -2 -1 0 1 2-0.4

-0.2

0

0.2

0.4

First Discriminant Analysis

Se

co

nd D

iscrim

ina

nt

An

aly

sis

other

-sheet

-helix



Table C1. Accuracy of Discriminant Function Analysis

Naive Bayes Results

Training set = 68x10; Testset = 16x10

Amide I scores from DFA

confusion matrix

α-helix β-sheet other

α-helix 5 0 0

β-sheet 0 4 1

other 2 1 3

accuracy = 75.00%

Amide II scores from DFA

confusion matrix

α-helix β-sheet other

α-helix 4 0 0

β-sheet 1 2 2

other 1 1 5

accuracy = 68.75%

The row of each confusion matrix are the three classes of structure used for the

Discriminant Function Analysis, the column are the classifier results for each class,

i.e., the number of class members that were classified correctly are shown as in the

class columns. As a result, all correct classifications are shown in the diagonal.

From the Amide II results, it can be seen that the model could not distinguish β-

sheet proteins properly, it has classified two beta sheet class members as α-helix

proteins and two as ‘other’ and this is reflected in the accuracy figure of about 68%

The accuracy value is determined by calculating all the true positives (along the

diagonal) divided by the total number of class members in the test set. For Amide

II that would be (11/16) * 100

APPENDIX

230

Appendix D

Appendix D contains a high level flow chart of the steps taken mainly for the

spectral pre-processing steps carried out in Chapter 2 and a generalized

Multivariate Analysis (MVA) flow done in Chapters 3 and 4.



Figure D 1. This top level flow chart represents the operations carried out in this thesis.

Level 3.0 and 4.0 can be done in the same step as part of the multivariate regression

analysis. Each step of the chart is expanded on different flow diagrams. See Figure D2

and D3 of this thesis.

APPENDIX

232

Figure D2. Step 1.0 of the flow chart highlights the steps taken for the ATR-FTIR spectra

measurements using the Bruker Tensor spectrometer with Opus 6.2 software. The process

in red is an external entity to the spectra measurement; the spectra can be saved and plotted

in any chosen plotting software.



Figure D3. Step 2.0 captures the pre-processing steps that were done on the ATR-FTIR

measured spectra used for this research. Each step of this flow chart was done in Matlab

R2010 and Matlab R2013a. The process in red is not necessary part of the pre-processing

steps and is expanded in Figure D4.

APPENDIX

234

Figure D4. This is the generalized step for the multivariate analysis done in this research.

The PCA and PLS steps are not broken down in this flow chart, detail methods can be read

from Chapter 3 (PCA) and Chapter 4 (PLS), this includes the processes in navy blue colour

that pertain to the internal workings of most multivariate analysis.

Using Regression Analyses For the Determination of Protein ...

Documents

Transcript of Using Regression Analyses For the Determination of Protein ...