Post on 18-Jan-2016
Example of regression by RBF-ANN
Prediction of charge on peptides after electron-spray ionization in mass spectrometry
What are the best attributes to predict charge?
Review of molecular biology
DNA sequence determines protein sequence
Amino acids with different side chains
have different names
Glycine gly G
alanine ala A
valine val V
leucine leu L
isoleucine ile I
methionine met M
porline pro P
phenylalanine phe F
tryptophan trp W
serine ser S
cysteine cys C
threonine thr T
glutamine gln Q
asparagine asn N
histidine his H
tyrosine tyr Y
glutamic acid glu E
aspartic acid asp D
lysine lys K
arginine arg R
What are amino acids?
C-terminusN-terminus
Side chain
chemical properties of amino acids
code mass pi pK1 pK2 charge Hydrophobic?
Polar?
A 89.09404 6.01 2.35 9.87 0 T F
R 174.20274 10.76 1.82 8.99 + F F
N 132.1190 5.41 2.14 8.72 0 F T
D 133.10384 2.85 1.99 9.9 - F F
C 121.15404 5.05 1.92 10.7 0 F T
E 146.14594 3.15 2.1 9.47 - F F
Q 146.14594 5.65 2.17 9.13 0 F T
G 75.06714 6.06 2.35 9.78 0 T F
H 155.15634 7.6 1.8 9.33 + F T
I 131.17464 6.05 2.32 9.76 0 T F
L 131.17464 6.01 2.33 9.74 0 T F
K 146.18934 9.6 2.16 9.06 + F F
M 149.20784 5.74 2.13 9.28 0 T F
F 165.1918 5.49 2.2 9.31 0 T F
P 115.13194 6.3 1.95 10.64 0 T F
S 105.09344 5.68 2..19 9.21 0 F T
T 119.12034 5.6 2.09 9.1 0 F T
W 204.22844 5.89 2.46 9.41 0 T T
Y 181.19124 5.64 2.2 9.21 0 F T
V 117.14784 6.0 2.39 9.74 0 T F
More properties of amino acids
Amino Acids Polymerize to Form Proteins (polypeptides)
-N-C-C-N-C-C-N-
H 0
R H R H
H 0
H
formation of peptide bond
Proteases: enzymes that cut proteins at the peptide bond
-N-C-C-N-C-C-N-
H 0
R H R H
H 0
H
Most proteases have cleavage specificity.
Trypsin cleaves mainly at arginine (R) and lysine (K)
Digestion of a protein with trypsin produces peptides of various length
Analysis of digestion mixture yields information about proteins in sample
peptides are retained for differing times on the LC column L
C c
olu
mn
Electro-spray ionization
Mass spectrometer
Digested protein mixture
Peptides may have multiple charges. Charges in dataset are averages from several runs
Liquid chromatography coupled to mass spectrometry
Sequence Charge
AAAAAAPDDVAAQLVVADLDLVGGHVEDAFAR 2.8
AAAAADLANR 2
AAAAAQASASAAAK 1.714286
AAAAAVAQGGPIEDAER 2
First 4 of ~ 23,000 data pairs are
Can peptide sequence be an input?
What inputs can we calculate from the input sequence?
Some suggestions for inputs from properties of amino acids
Length of peptideMass of peptideFirst amino acidLast amino acidFactions of amino acids of each typeFractions of hydrophobic, polar, and charged residuesNet formal chargeAverage isoelectric pointAverage disassociation constant
MLP with default options.600 examples reserved for test setPoor results
Other regression options