Text mining
Explosion
exponential increase
some things are constant
“graph calculus”
=
~45 seconds per paper
Information retrieval
find the relevant papers
user-specified query
“yeast AND cell cycle”
stemming
dynamic query expansion
ranking
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
no tool will find that
Entity recognition
identify the substance(s)
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009
comprehensive lexicon
orthographic variation
“black list”
manual correction
still too much to read
Information extraction
formalize the facts
co-occurrence
global statistical analysis
NLPNatural Language Processing
parsing individual sentences
Gene and protein names
Cue words for entity recognition
Verbs for relation extraction
[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
store in a database
then the fun begins :-)
Acknowledgments
NLP pipeline– Jasmin Saric– Rossitza Ouzounova– Isabel Rojas– Peer Bork
Reflect– Heiko Horn– Sune Frankild– Evangelos Pafilis– Sven Haag– Michael Kuhn– Peer Bork– Reinhardt Schneider– Sean O’Donoghue
Top Related