PROMPT Protein Mapping and Comparison Tool By Thorsten Schmidt and Dmitrij Frishman Free for...
-
date post
18-Dec-2015 -
Category
Documents
-
view
217 -
download
3
Transcript of PROMPT Protein Mapping and Comparison Tool By Thorsten Schmidt and Dmitrij Frishman Free for...
PROMPT
Protein Mapping and Comparison Tool
By Thorsten Schmidt and Dmitrij Frishman
Free for academic. Website http://webclu.bio.wzw.tum.de/prompt/ (Binary + Source)
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT
Motivation
Past:
Sparse data available
single pairwise comparison
Present + Future:
High-throughput technologies
weighting large protein datasets against each other
Differences between individuals
Differences between populations
Hundreds of questions:
• Do Germans drive faster than Americans?
• Is one gene group significantly enriched in certain functional categories?
• Do GroEL depending proteins prefer certain structural folds?
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT
Input
FASTA x x
GenBank x x x
EMBL x x
Swiss-Prot x x x x
UniProt XML x x x x
Generic XML x x
Generic XML Input allows to import any numeric or nominal data
Folder with multiple files
File with single (protein) entry
File with multiple (protein) entries
List of identifiers
Analyse annotations
Additionally
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPTProtein set A(SwissProt, EMBL, GenBank,
PEDANT, SIMAP, FASTA, XML)
Protein set B(SwissProt, EMBL, GenBank,
PEDANT, SIMAP, FASTA, XML)
Dataset A Dataset B
ProcessingLayer ComparisonMapping
Statistical testing
InputLayer
User Input
Parsing CachingRetrieval
Results
Presentation Layer
Figure Plotting
Export
Export
ExportView
Within PROMPT
Spreadsheet
Import
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT
Statistical tests Help about each test and its parameter.
Although you can apply any test manually,in the most cases appropriate tests are performed automatically.
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT
Built-in help
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT
Case study: SCOP fold comparison GroEL depending substrats vs. Lysate
Background:Around 200 proteins in E.coli depend on the
GroEL chaperon for folding. Questions
What distinguish the GroEL depending proteins?
Data:PEDANT genome from clu1.gsf.de E.coli K12
(updated version) Assignment threshold 1 E-4 for SCOP folds
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT
Symbolic Frequency Comparison (Symbolic), (Symbolic)
Fraction relative to the number of proteins with annotations in each set
P-value* < 0.05** < 0.001*** < 0.0001
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT
Case study:Comparison of pI distributions
Question:Do the proteins of E.coli and H.pylori differ with
respect of their isoelectric points?
Data: Protein sequences of H.pylori and E.coli The pI is calculated by PROMPT automatically (as many
other sequence based properties too)
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT
Numeric Distribution Comparison (Numeric), (Numeric)
Statistical tests:
•Kolmogorov-Smirnov test
•Mann-Whitney
•Chi Square Test
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT
Case study:Protein length and hydrophobicity Question:
Is there any relationship between protein length and hydrophobicity in membrane proteins?
Data: 2 multi FASTA files with amino acid sequences
membrane.fasta contains all membrane* proteins of E.coli fullgenome.fasta all proteins of E.coli *) all proteins with more than 6 membrane spanning regions
predicted by TMHMM 2.0
The GRAVY (grand average hydrophobicity) value and a lot of other computable properties are calculated from the sequence by PROMPT automatically
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT
Numeric Correlation
200 400 600 800 1000
0.0
0.2
0.4
0.6
0.8
1.0
1.2
length
Hyd
roph
obic
ityA
vg
0 500 1000 1500 2000
-2-1
01
length
Hyd
roph
obic
ityA
vg
New research result: The longer membrane proteins are the less hydrophobic they are
X-Axes: Protein length
Hyd
roph
obic
ity:
GR
AV
Y v
alue
Numeric property
Numeric property
[ Pearson coefficient -0.69; p-value 2.8 E-54 ]
A. All E.coli proteins B. Membrane proteins only
(Numeric x Numeric)
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT
Protein set A(SwissProt, EMBL, GenBank,
PEDANT, SIMAP, FASTA, XML)
Protein set B(SwissProt, EMBL, GenBank,
PEDANT, SIMAP, FASTA, XML)
IDs +sequences IDs only IDs +sequencesIDs only
Sequences are retrieved automatically
Web-services
Web-services DB
Query
Compare A and B by BLAST,find equivalent sequences
Mapped identifiers
Set BSet A
ID5ID3
No equivalentID2
ID3ID1
A: IDs + sequences
B: IDs + sequences
Use
r In
pu
t
PR
OM
PT
Res
ult
s
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT
Data Import and Mapping
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT
Blast parameter dialog
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT
View Mapping Results
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT
Mapping filteringChoose correct assignments by 2 ways:
•Manually e.g. expert knowledge
•Automatic filter with user specific parameters e.g.
Select SUBJECT_ID where IDENTITY>99 and MISMATCHES<5
Manual further processing e.g.save GIs to text file
Generic XML file:Symbolic property holds mapping informationVFDB1 <-> GI_1234VFDB3 <-> GI_3456…
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT
Case studies summary
Example Type of Data used PROMPT Method:
FunCat distribution in Human (*) (Symbolic) Symbolic feature frequencies
Scop Fold enrichment of GroEL depending substrates
(Symbolic) , (Symbolic) Symbolic feature comparison of two sets
Fold bias of virulence factor proteins (*)
(Symbolic) subset of (Symbolic)
Symbolic feature enrichment in subset vs. set
pI comparison of H.pylori and E.coli
(Numeric) , (Numeric) Numeric feature comparison
Protein length and hydrophobicity (Numeric x Numeric) Numeric feature correlation
Essentiality and protein (*) abundance
(Symbolic x Numeric) Numeric distribution within categories
Note: x means corresponding data pairs e.g. here describing two values of the same protein(*) not shown in this talk
As the generic XML input allows the processing of any kind of nominal or numeric data, PROMPT can be applied to nearly any problem domain
As the generic XML input allows the processing of any kind of nominal or numeric data, PROMPT can be applied to nearly any problem domain
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT
Scripting
Scripting ways: Interactive Console Stream (e.g. from pipeline) File
Scripting commands Beanshell = simplified Java Or full Java code
Advantages Run Java-code directly No compilation necessary All PROMPT classes are available from the scripts „Classpath hell“ was yesterday
Just call:./prompt.sh Filename.java
PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT
Conclusions
PROMPT can map, compare and analyse protein sets
Easy-to-use interactively Large-scale batch processing Automatical or manual testing for significance Helps to avoid to reinvent the wheel Graphical visualisations pointing up results Generic
application even beyond bioinformatics
Dig our data gold mine efficiently