What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge...
-
date post
19-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge...
What's in a word ?
Term-based approaches across
bioinformatics, scientometrics and knowledge management
Patrick Glenisson
Bio-informatics groupDept Electrical Engineering K.U.Leuven, Belgium
Steunpunt O&O StatistiekenFaculty of EconomyK.U.Leuven, Belgium
3
Introduction: K.U. Leuven
Faculty of Applied Sciences
Department of Electrical Engineering
Bio-informatics research
clinical bioinformatics
gene regulation bioinformatics
Research on algorithms and software development for:
Text mining
Gibbs sampling
Graphical models
Classification & clustering
4
Introduction: K.U. Leuven
Faculty of Applied Sciences
Department of Electrical Engineering
Bio-informatics research
Text mining research
Combine statistical approaches with domain-specific requirements
Knowledge discovery through literature analysis in various domains:
Bio-informatics
Sciento- & Technometrics
Knowledge management
5
Overview
• Bio-informatics:– gene profiling– multi-view learning
• Scientific trend mapping– clustering and bibliometric indicators
• Innovation & Spillovers– Tracing of person in science & technology
spaces
25’
5-10’
6
Overview
InformationRetrieval
InformationExtraction
Full NLP parsing
Shallow Statistics
GenericProblemspecific
Domain-specific
Shallow Parsing
Document analysis &Extraction of tokens
Text mining goals
Text mining methodology
Overall approach
10
‘Post-genome’ biology focus shift :
- from single gene to gene groups- complex interactions within cellular environment
microarrays measure the simultaneous activity:
Gene expression measurement
G1G2G3
..
C1
C2
C3 ..
Sample annotations
Gen
e an
no
tati
on
s
12
gene
conditions
Expression data
gene expression Databases
annotations and relationsencoded as free text
PRIORINFORMATION
Integrated analysis
13
Hence, 2 views:
• Text analysis for interpretation (supportive role)
• Text analytics for ‘inference’ (active role)
14
A ‘historical’ quote:
`Until now it has been largely overlooked that there is little difference between retrieving an abstract from MEDLINE and downloading
an entry from a biological database’ (M. Gerstein, 2001)
12133521VEGF is associated with the development and prognosis of colorectal cancer.
12168088PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression.
11866538 Vascular endothelial growth factor modulates the Tie-2:Tie-1 receptor complex
GeneRIFGO
• cell proliferation
• heparin binding
• growth factor activity
15
• Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems.
• Structured vocabularies are on the rise• GO• MeSH• eVOC
• Standards are systematically being adopted to store biological concepts or annotations:
• HUGO for gene names• GOA• …
Increased awareness
16
(GOF) Vector space model• Document processing
– Remove punctuation & grammatical structure (`Bag of words’)– Define a vocabulary
• Identify Multi-word terms (e.g., tumor suppressor) (phrases)• Eliminate words low content (e.g., and, gene, ...) (stopwords)• Map words with same meaning (synonyms)• Strip plurals, conjugations, ... (stemming)
– Define weighing scheme and/or transformations (tf-idf,svd,..)
• index
T 1
T 3
T 2
vocabulary
gene
17
Validity of gene indexGenes that are functionally related
should be close in text space:
Modeled wrt a background distribution of through random and permuted gene groups
Text-based coherence score
20
Data-centered statistical scores
Coherence vs separation of clusters
Stability of a cluster solution when leaving out data
Define `optimal’ ?
Optimal number of clusters ?
C1
C3
C2
Text-based scoring
21
Data-centered statistical scores
Knowledge-based scores
Enrichment of GO annotations in clusters
Literature-based scoring
Define `optimal’ ?
Optimal number of clusters ?
23
TXTGate
• a platform that offers multiple ‘views’ on vast amounts of (gene-based) free-text information available in selected curated
database entries & linked scientific publications.
• incorporates term-based indices ..
• .. and use them as a starting point– to explore the text through the eyes of different domain vocabularies
– to link out to other resources by query building, or
– to sub-cluster genes based on text.
26
• Flexible tool for analyzing gene groups (~100 genes) due to various term- and gene-centric vocab’s
• … that allow some level of interoperability with external annotation databases
• Sub-clustering gene groups useful to detect biological sub-patterns
• Reasonably robust to corrupted groups
• Gene index normalizes for unbalanced references
Features of the approach
27
• Text analysis for interpretation (supportive role)
• Text analytics for ‘inference’ (active role)
28
Meta-clustering text & data
• As multiple information sources are available when analyzing gene expression data, we pose the question:
“How can we analyze data in an integrated fashion to extract more information than from the expression data alone ? ”
..
30
• In each information space
– Appropriate preprocessing– Choice of distance measures
Integration of text & data
31
• Combine data:
• confidence attributed to either of the two data types
• in case of distance, we can see it as a scaling constant between the norms of the data- and text representations.
32
• However, distribution of distances invoke a bias Scaling problem
• Therefore, use technique from statistical meta-analysis (so-called omnibus procedure)
Expression Distance
histogram Text Distance
histogram
33
M-score expression data only
M-s
core
int e
gra t
ed c
lust
e rin
gVarious cutoffs k of the cluster tree
Optimal k ?
37
Mapping of Science
• Journal ‘Scientometrics’
• Full-text articles• Document cluster
analysis
• Co-word mapping• Temporal dimension:
clusters over time
38
Mapping of Science
• Coupling with bibliometric indicators; – Based on reference
(hyperlink) information
– Mean reference Age– Nr Serials
40
User profiling & Author-Inventor linkage
• Name resolution– Same persons (variants, mistakes)
– Different persons (similar initials, or even full name)
Van Veldhoven Veldhoven, Van
Wim Van Veldhoven Walter Van Veldhoven
Wim Van Veldhoven Wim Van Veldhoven
VanveldhovenVan Veldhoven
41
Content-based name matching
• Detect spillovers and entrepreneurial activities at (e.g.) university-level
• Matching of ‘inventors’ & ‘authors’ time-consuming semi-automated approach:
Patent DB Publication DB
Relevance ranking
42
AcknowledgementsSteunpunt O&O Statistieken
Debackere K Glänzel W
ESAT / BioI / Text Mining:Coessens B Van Vooren S Janssens F Van Dromme D
ESAT / BioI:Moreau Y De Moor B