Text-mining practical
-
Upload
lars-juhl-jensen -
Category
Science
-
view
196 -
download
1
Transcript of Text-mining practical
unix primer
the command line
some useful commands
cat
less
head -10
tail -10
grep ‘needle’
cut -f 2
sort
sort -nr
uniq -c
redirecting output
write to file
command > filename
using pipes
command1 | command2
putting it all together
cut -f 4 infile | sort | uniq -c |sort -nr | head -100 > outfile
the task
disease gene finding
named entity recognition
human genes
gene prioritization
what I have done
information retrieval
two diseases
prostate cancer
schizophrenia
two sets of documents
62,755 abstracts
65,588 abstracts
one directory with each set
one file with each abstract
dictionary
tab-delimited file
human genes
22,523 entities
synonyms
from many databases
orthographic variation
prefixes and postfixes
automatically generated
2,726,495 names
tagdir program
flexible matching
upper- and lower-case
spaces and hyphens
tab-delimited output
what you will do
named entity recognition
find unfortunate names
create “black list”
information extraction
co-mentioning
within abstracts
rank genes for each disease
find shared gene
a helping hand
“black list”
100+ matches
10+ matches
wrap up
prostate cancer
FOLH1
schizophrenia
Glutamate carboxypeptidase II
same protein
synonyms matter
“black list” is crucial
text mining is useful
not black magic
EMBO Practical Course Computational Biology:Genomes to SystemsPuerto Varas, 3-9 April 2014
Thank you!
Thank you!