Introduction
Previous
Work
Method
Backup
Acknow
ledgments
Reference
Background
Investigating mRNA’s of intrinsically disordered proteins
Harini Gopalakrishnan
Advisor: Dr. Predrag Radivojac
Introduction
Previous
Work
Method
Backup
Acknow
ledgments
Reference
Background
Basic Facts –mRNA
1. mRNA-Messenger Ribonucleic Acid
2. Nucleic Acid polymer consisting of nucleotide
monomers adenine, guanine, cytosine and uracil
3. Three important types
• rRNA (ribosomal RNA)
• tRNA (transfer RNA)
• mRNA (messenger RNA)
Introduction
Previous
Work
Method
Backup
Acknow
ledgments
Reference
Background
Basic Facts –mRNA (contd)
Encodes and carries information from DNA to protein synthesis
http://en.wikipedia.org/wiki/Image:Mature_mRNA.png
Introduction
Previous
Work
Method
Backup
Acknow
ledgments
Reference
Background
Basic Facts-mRNA (contd)
What is significance of mRNA folding?
Secondary Structures have been used to explain • Translational controls• Regulatory function in the cell especially the non-coding mRNA
What are the different folding algorithms?
• Energy Minimization• Base Pair Maximization • Covariation
Eg: Mfold, Vienna Package
Introduction
Previous
Work
Method
Backup
Acknow
ledgments
Reference
Background
Basic Facts-Disordered ProteinWhat is a disordered Protein?
• lack a well defined three-dimensional structure
• conserved between species in composition and sequence
• presence of low sequence complexity
• amino acid compositional bias away from bulky hydrophobic residues
What are the significance of disorder Proteins?
regulation of transcription and translation, cellular signal transduction, protein phosphorylation, the storage of small molecules and the regulation of the self assembly of large multiprotein complexes such as the bacterial flagellum and the ribosome
Introduction
Previous
Work
Method
Backup
Acknow
ledgments
Reference
Background
Basic Facts-Disordered Protein
What is its role in diseases?
Famous (or infamous?) disorder proteins in diseases
-alpha-synuclein -p53 -proteins in HPV’s linked to Ovarian Cancer
What are the different predictors that are used?(all based on amino acid sequence inputs)
VL2,VSL2,PONDR,VLXTImage Courtesy: http://www.disprot.org
Introduction
Previous
Work
Method
Backup
Acknow
ledgments
Reference
Background
Snapshot from Previous Studies …..
• Third Codon and stability
• Speed of translation and protein secondary structures
-alpha helices and beta sheets
• The three bases in the codon
1st base -Biosynthetic pathway
2nd base -Residue hydrophobicity
3rd base -helix or beta strand-forming potential of amino
acid
Introduction
Previous
Work
Method
Backup
Acknow
ledgments
Reference
Background
In a Nutshell
• Check if nucleotide composition has a bias towards the proteins being ordered and disordered
• Check if the stability of RNA fold have any say in differentiating the proteins between the two categories.
• Work is different because no study has linked Protein disorder and mRNA composition and stability.
• Also establishing the correlation would open new avenues in studying how protein structure can be inferred directly from its precursor- the mRNA.
Introduction
Previous
Work
Method
Backup
Acknow
ledgments
Reference
Background
Hypothesis
• There should exist some kind of codon bias between the mRNA sequence of ordered and disordered protein
• There should be a difference in folding energy stability between the mRNA of ordered and disordered proteins
• There is a correlation between the age of codons and disordered proteins
Central dogma
Introduction
Previous
Work
Method
Backup
Acknow
ledgments
Reference
Background
Method
•Data Collection
•Implementation
•Analysis
•Future Work
Introduction
Previous
Work
Method
Backup
Acknow
ledgments
Reference
Background
Data CollectionOne of the important phases , as the whole significance of the analysis lies on the quality of data set selected for both the categories of proteins.
After experimentation with various other databases, proteins were finally taken from the unigene90, DisProt and PDB
Disorder was predicted using VSL2B
True Dataset(Experimentally Verified)
Predicted Dataset (From disorder predictors)
Dataset
Introduction
Previous
Work
Method
Backup
Acknow
ledgments
Reference
Background
Data Collection
Once we have the proteins of interest, we use Uniprot to webmine the protein and corresponding mRNA dataset based on their unigene id
Problem!
•Introns
•Poly A tails, which need to be removed
We need a clean data set, in order to study Codon Usage, and nucleotide composition
Introduction
Previous
Work
Method
Backup
Acknow
ledgments
Reference
Background
Solution - Alignment BLAST
•Proved to be efficient while aligning the ordered proteins
•Extremely inefficient while aligning protein vs. mRNA for the disordered set of proteins
•Disorder proteins have more low complexity region
WISE
•Software by the EMBL institute to align protein vs. nucleotide data
•Uses Markov Chain methods to make gene predictions and hence identifies introns
•Extremely efficient and provided qualitative datasets
Introduction
Previous
Work
Method
Backup
Acknow
ledgments
Reference
Background
Data Collection-Final input Statistics
343
151
9681
Predicted Order
Predicted Disorder
True Order
True Disorder
Introduction
Previous
Work
Method
Backup
Acknow
ledgments
Reference
Background
Method -Overview
Analyzed mainly two characteristics of mRNA
Nucleotide Composition of mRNA
• Codon Usage
• Nucleotide Composition
RNA Folding Energy and Base Pair analysis using Mfold
• number of base pair formation
• total minimum free energy per RNA fold between
Reference Introduction
Previous
Work
Method
Backup
Acknow
ledgments
Background
Methods
Mfold Snapshot
Introduction
Previous
Work
Method
Backup
Acknow
ledgments
Background
Mfold -Overview
What is Mfold?
A mRNA secondary structure prediction algorithm by M. Zuker and N.Markham
How does it work?
It is based on the nearest neighbor thermodynamic rules in which free energies are assigned to loops rather than base pairs. It tries to predict the optimal structure by minimizing the overall free energy of the structure formed by coaxial stacking of helices.
What does it output?
Several output files for every optimal and sub optimal folds within the allowable energy range are obtained. Energy dot plot (on the right) is one important component of this predictor output
Introduction
Previous
Work
Method
Backup
Acknow
ledgments
Background
Tools Employed
•Parsing and mining information on Web done by PERL
• Analysis and graphs done using MATLAB
• Reporting and graphs done in Excel
• Disorder Prediction using mRNA inputs was done in MATLAB using SVM
Method
Reference Introduction
Previous
Work
Method
Backup
Acknow
ledgments
Background
Results
Nucleotide Composition
Nucleotide DP OP P-Value (DP, OP)
A 0.267 0.239 0.0067
C 0.275 0.256 0
G 0.291 0.250 0
T 0.166 0.255 0
Nucleotide DT OT P-Value (DT, OT)
A 0.275 0.267 1.06E-02
C 0.270 0.247 5.04E-17
G 0.271 0.259 4.55E-05
T 0.183 0.226 5.21E-57
True Dataset
Predicted Dataset
Introduction
Previous
Work
Method
Result
s
Acknow
ledgments
Reference
Background
Nucleotide Composition
Analysis of Codon Age
Analysis based on the Composition of mRNA
Introduction
Previous
Work
Method
Result
s
Acknow
ledgments
Reference
Background
OldNew
Amino acid
New
codon
14 out of 18 Amino Acids have Disorder promoting Codon as the older one
2 amino acids (M and W) are neutral as they have only one codon each
Base Composition
Third Base Second Base First Base
Base OP DPTotal % OP DP
Total % OP DPTotal %
G 4 9 13 26.07 9 5 14 -4.88 10 6 16 -3.57
C 4 12 16 38.57 10 6 16 -3.57 8 8 16 10.48
T 14 2 16 -31.67 9 6 15 -0.71 7 5 12 0.83
A 13 1 14 -32.98 7 7 14 9.17 10 5 15 -7.74
Predicted Dataset
Introduction
Previous
Work
Method
Result
s
Acknow
ledgments
Reference
Background
Base Composition Preferential selection of
codons with “g” or “c” for the
third base
Base Third Base
Order Disorder T-test R Test
g 33949 11475 4.49E-48 0.3263
c 26576 8594 1.17E-15 0.4424
t 23488 6404 3.03E-11 0.0308
a 31324 7721 2.40E-64 0.3548
Statistical Verification
Energy of Folding and Base Pair
Energy of Folding
Introduction
Previous
Work
Method
Result
s
Acknow
ledgments
Reference
Background
Predicted Dataset
Dataset PP-value
OP DP
Average Minimum Energy (Kcal) -2230 -2487.27 7.08E-03
Average Energy(Kcal) -2170 -2428.29 6.93E-03
Average Length 677.57 679.35 0.87
Energy of Folding and Base Pair
Base Pair Analysis
Introduction
Previous
Work
Method
Result
s
Acknow
ledgments
Reference
Background
Base Pair Analysis
Summary-Nucleotide Analysis OP DP P-Value
OP vs. DP
Average Length 1005.77 732.9 --
Average Bases 0.062 0.050 0.0063
Bonding ability of A 0.118 0.118 0.2367
Bonding ability of C 0.133 0.08 3.33e-06
Bonding ability of G 0.151 0.10 7.72e-08
Bonding ability of T 0.146 0.14 0.81
Energy of Folding and Base Pair
Introduction
Previous
Work
Method
Result
s
Acknow
ledgments
Reference
Background
Sequence Entropy Plot
Future Work Introduction
Previous
Work
Method
Result
s
Acknow
ledgments
Reference
Background
Predictions
Using Support Vector Machines(SVM’s)
• Based on Codon Composition
• Age of Codons
• Base Composition
Accuracies have been good and promising
Aim: To predict disorder from mRNA based on all above information
Future Work Introduction
Previous
Work
Method
Result
s
Acknow
ledgments
Reference
Background
Acknowledgments
Dr. Predrag Radivojac
Dr. Haixu Tang
Dr. Vladimir Uversky
Amrita Mohan
Linda Hostetter
Informatics faculty and staff
My various Course Professors
Friends and Fellow Students
Future Work Introduction
Previous
Work
Method
Result
s
Acknow
ledgments
Reference
Background
References1. http://helix.nih.gov/docs/online/mfold/node3.html2 Jan C Biro Nucleic acid chaperons: a theory of an RNA-assisted protein folding Theoretical Biology and Medical Modeling 2005, 2:35 3 T. A. Thanaraj and p. Argos Protein secondary structural types are differentially coded on messenger RNA Protein Sci. 1996 5: 1973-19834 Taylor FJR, Coates D. 1989. The code within codons. Biosystems 22:177-187.5.Brunak S, Engelbrecht J, Kesmir C. 1994. Correlation between protein secondary structure and the mRNA nucleotide sequence Protein Structure by Distance Analysis. Amsterdam: 10s Press. pp 327-334.6. H Jane Dyson and Peter E Wright Intrinsically Unstructured proteins and their functions Nat Rev Mol Cell Biol. 2005 Mar; 6(3):197-208 7. Dunker, A.K., Brown, C.J., Lawson, J.D., Lakoucheva, L.M, and Obradovic, Z Intrinsic disorder And Protein Function.8 Tompa P Intrinsically Disorder proteins evolve by repeat expansion Bioessays 2003 Sep; 25(9):847-55 9 Svetlana A. Shabalina, Aleksey Y. Ogurtsov, and Nikolay A. Spiridonov A periodic pattern of mRNA secondary structure created by the genetic code Nucleic Acids Res. 2006; 34(8): 2428–243710 Edward N Trifonov Theory of Early Molecular Evolution Landes Biosciences 200611 E.N.Trifonov Consensus temporal order of Amino Acids and evolition of the triplet code Gene 2000 ;( 261):139-15112 Predrag Radivojac, Zoran Obradovic, David K. Smith, Guang Zhu, Slobodan Vucetic, Celeste J. Brown J. David Lawson and A. Keith Dunker Protein flexibility and intrinsic disorder Protein Science (2004), 13:71-8013 N. R. Markham & M. Zuker. UNAFold: software for nucleic acid folding and hybridizing. Methods in Molecular Biology: Bioinformatics. Totowa, NJ: Humana Press, in press.14 Peng K., Radivojac P., Vucetic S., Dunker A.K., and Obradovic Z., Length-Dependent Prediction of Protein Intrinsic Disorder, BMC Bioinformatics 7:208, 2006.15 Gene Ontology: tool for the unification of biology. Nture Genet. (2000) 25: 25-29.16 Brooks D, Singh, M, Fresco J R Selection influences the proteomic usage of a majority of amino acid17 Vucetic S, Obradovic Z, Vacic V, Radivojac P, Peng K, Iakoucheva LM, Cortese MS, Lawson JD, Brown CJ, Sikes JG, Newton CD, and Dunker AK. 2005Disprot: A database of protein disorder Bioinformatics 21:137-140
Top Related