Identifying, Indexing, and Ranking Chemical Formulae and ... · Introduction End users demand fast...
Transcript of Identifying, Indexing, and Ranking Chemical Formulae and ... · Introduction End users demand fast...
Identifying, Indexing, and Ranking Chemical Formulae and Chemical
Names in Digital Text
slides by Suleyman Cetintas & Luo Si
1
Outline
Introduction
Chemical Entity Mentions
SMILES, InChi, UIPAC Nomenclature, Trivial Names
Chemical Entity Extraction from Text
Chemical Name Segmentation
Chemical Entity Indexing
Chemical Entity Search
Text Retrieval Conference – Chemical Track
References
2
Introduction
End users demand fast responses to searches for
chemical entities (e.g., chemical formulae and chemical
names)
A chemical search engine
must identify all occurrences of chemical entities
must index them in order to enable fast access
can be done offline, but still challenging due to large data
Tagging chemical formulae & chemical names
hard problem due to inherent ambiguity in natural language
text
3
Introduction
Partial formulae or partial chemical names
chemists and users of chemical search engines desire to input
chemical name or formula
expect the search engine to return documents having chemical
entities that contain the partial formula or chemical name
indexing on sub-formulae is required by the search engine for
efficiency
indexing all possible sub-formulae of any formula or sub-names
of chemical names
require large index, prohibitively expensive for time & memory
requirements
index-pruning is required [Sun et al. 2011]
4
Introduction
Different forms of the same formula
users can search CH3COOH or C2H4O2 (same formula)
both appear in significant number of documents
for larger chemical formulae, the diversity is even greater
search engine ChemIndustry.com returns the ‘synonyms’ of a chemical
formula and the docs containing those
need to identify the chemical formula, and disambiguate from
other non-chemical-entity-related abbreviations
E.g., OH can be “hydroxyl group” or the state “Ohio”
hard problem as it requires context analysis and natural language
processing (NLP)
5
Introduction
Partial chemical name searches
segmenting a chemical name into meaningful subterms
e.g. “ethly” or “methyl” instead of “ethy” or “lmethy”
let users perform partial name searches
Tools: Name=Struct [Brecher, 1999], CHEMorph [Kremer et al., 2006],
OPSIN [Corbett and Murray-Rust, 2006]
segment a chemical name into its morphemes, map the morphemes into
their chemical structures, and use these structures to construct the
structure of the named chemical
2 main directions
using dictionaries and lexicons
identifying and using frequent substring patterns in text
6
Introduction
Architecture of a chemical entity search engine with
document search in ChemXSeer [Sun et al., 2010]
7
Chemical Entity Mentions
Finding mentions of chemical compounds in text is
important for several reasons:
annotation of the entities enables a search engine to return
documents containing elements of this entity class (semantic
search), e.g. together with a disease
mapping found entities to corresponding structures leads to
the possibility to search relations between different chemicals
then, a chemist can search for similar structures, substructures,
and combine the information from the text with other tools
8
Chemical Entity Mentions: SMILES
Chemical names can be distinguished into different
classes: to deal with complex structures
SMILES
mentions of the sum formula or names according to the Simplified
Molecular Input Line Entry Specification (SMILES) [Weininger, 1988]
more human readable than InChi (shown in next slide)
has a wide base of software support with extensive theoretical
(e.g., graph theory) backing
a number of equally valid SMILES can be written for a molecule
e.g., CCO, OCC and C(O)C all specify the structure of ’ethanol’
allow direct structure search
limited readability of such specifications for humans
therefore trivial names are used more frequently in scientific texts
9
Chemical Entity Mentions: InChi
Chemical names can be distinguished into different
classes: to deal with complex structures
InChi
successor of SMILES, the IUPAC International Chemical Identifier (InChi)
current version is 1.03 and was released in June 2010
InChI algorithm converts input structural information into a unique
InChI identifier in a three-step process:
normalization (to remove redundant info.), canonicalization (to generate a
unique number label for each atom), serialization (to give a string of chars)
unique representation
standard InChi for ‘ethanol’ is ‘InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3’
less human readable than SMILES, can get quite lengthy
allow direct structure search
limited readability of such specifications for humans
therefore trivial names are used more frequently in scientific texts
10
Chemical Entity Mentions: UIPAC Nomenclature
Chemical names can be distinguished into different
classes: to deal with complex structures
UIPAC Nomenclature
set of rules to generate systematic names for chemical compounds to
ensure that a chemical name leaves no ambiguity as to what it refers
worldwide the most used chemical nomenclature
i.e., each chemical name should refer to a single substance
does not hold the complete structure information (unlike InChi)
human readable (unlike InChi), and more human readable than SMILES
developed and kept up to date under the auspices of the International
Union of Pure and Applied Chemistry (IUPAC)
along with trivial names (shown in next slide) more common in
scientific text than SMILES or InChi as more human readable
11
Chemical Entity Mentions: Trivial Names
Chemical names can be distinguished into different
classes: to deal with complex structures
Trivial Names
in biology and chemistry, a common name or vernacular name is a
non-systematic name or non-scientific name
most human readable of all four main representations
the name is not recognized according to the rules of any formal
(e.g. IUPAC) system of nomenclature
unlike SMILES & InChi
along with UIPAC names, more common in scientific text than SMILES or
InChi as more human readable
UIPAC names and trivial names does not allow for direct structure search
12
Chemical Entity Extraction from Text
(Generic) Entity Extraction from Text
Hidden Markov Models (HMMs) [Baum et al., 1970]
commonly used to label or segment sequences
independence assumption, given the hidden state, observations are
independent
hence, can not capture the interactions between adjacent tokens
Maximum Entropy (ME) [Borthwick, 1999]
exponential prob. model based on binary data from sequences
estimate parameters with MLE
Maximum Entropy Markov Models (MEMMs) [McCallum et al.,
2000]
exp. prob. models that take the observation features as input, and
output a prob. distribution over possible next states
suffer from label-bias problem 13
Chemical Entity Extraction from Text
(Generic) Entity Extraction from Text
Conditional Random Field (CRF) [Lafferty et al. 2001]
unlike HMM & MEMM (that use directed graphical models), uses an
undirected graphical model
relaxes conditional independence assumption of HMMs
avoid the label-bias problem of MEMMs
used for labeling sequences
named entity recognition [McCallum & Li, 2003]
detecting biological entities
e.g., proteins [Settles, 2005]
genes [McDonald & Pereira, 2005])
14
Chemical Entity Extraction from Text
Chemical Entity Extraction
challenging problem due to ambiguity, different representations, etc.
examples of chemical formulae, names, and ambiguous terms
15
Chemical Entity Extraction from Text
Chemical Entity Extraction
several early approaches (machine learning & rule based)
automatic recognition of chemical names in natural text
first by [Hodge et al., 1989]
bayesian classification using n-grams [Wilbur et al., 1999]
rule based algorithms by [Narayanaswamy et al., 2003]
unsupervised approaches by [Vasserman, 2004]
Oscar 3 [Corbett & Murray-Rust, 2006]
unsupervised method using n-grams, but further uses Kneser-Ney
smoothing [Kneser & Ney, 1995]
also used n-based models & MEMM [Corbett & Copestake, 2008]
F1 of 0.807 on chemical journals, 0.832 on PubMed abstracts
MEMM has shorter training cycles, but suffers from the label-bias
problem
16
Chemical Entity Extraction from Text
Chemical Entity Extraction
CRF based approaches by [Sun et al., 2007, 2008, 2010; Klinger
et al., 2008]
avoids label-bias problem
requires more training time than MEMMs
testing time is comparable with MEMMs
better precision & recall values reported [Sun et al., 2010]
Classifiers such as SVMs can be used to tag chemical formulae
asymmetric binary classification problem on imbalanced data
many more false samples than true samples
precision and recall of true samples are more important than overall acc.
decision boundary dominated by false samples
cost sensitive classification & decision threshold tuning studied for
imbalanced data [Shanahan & Roma, 2003]
17
Chemical Entity Extraction from Text
Chemical Entity Extraction
Feature Sets
utilizing parts-of-speech tagging tools (e.g., OpenNLP), a lexicon of
chemical terms (e.g., WordNet)
example of the feature set used in [Sun et al., 2010]
18
Chemical Name Segmentation
Chemical Name Segmentation
enables partial chemical name search
e.g., acetaldoxime is segmented into acet & aldoxime, aldoxime is
further segmented into ald & oxime.
if end user searches for aldoxime or oxime, the documents referrings
to acetaldoxime will be returned by the system
early approaches
breaking down the chemical name into its morphemes [Garfield,
1962]
does not work well since it attempts to match the longest string from the
right to left with dictionary entries
context free grammars [Cooke-Fox et al., 1989]
as people use chemical names that do not conform to formalized rules,
shown to be not effective by [Brecher, 1999]
19
Chemical Name Segmentation
Chemical Name Segmentation
OPSIN, a subsystem of OSCAR3 system [Corbett & Murray-
Rust, 2006]
an Open Parser for Systematic IUPAC Nomenclature (OPSIN)
open source license along with OSCAR3
used finite state grammar, ‘less expressive but more tractable’ than
context free grammars
their tokenization is based on “a list of multi-character tokens” and “a
set of regular expressions”; both created manually
20
Chemical Entity Indexing
Chemical Entity (Formula and Name) Indexing
set of partial formulae of the set of all chemical formulae is
quite large
many of the have redundant information
index selected and discriminative partial formulae only
segmenting chemical names into “meaningful” substrings and
indexing them
e.g., for ‘methylethyl’, indexing ‘methyl’ & ‘ethyl’ is enough, while
‘hyleth’ is not necessary
21
Chemical Entity Indexing: Formulae
Chemical Formula Indexing
same molecule may have different formula representations
‘acetic acid’ can be represented as ‘CH3COOH’ and ‘C2H4O2’
same formula can represent different molecules
C2H4O2 can be ‘acetic acid’ (CH3COOH) or ‘methyl formate’
(CH3OCHO)
indexing all formulae is prohibitively exprensive
sub-formulae of CH3OH are C, H3, O, H, CH3, H3O, OH, CH3O,
H3OH, CH3OH
query logs would reveal which sub-formulae are cost-effective to
index
when query logs not available, assumption can be made that
infrequent sub-formulae will not be queried frequently [Yan et al.,
2004; Sun et al., 2010]
22
Chemical Entity Indexing: Name
Chemical Name Indexing
before a chemical name is indexed, it should be segmented into
its sub-terms (or morphemes)
e.g., ’10-Hydroxy-trans-3-oxadecalin’ will first be segmented into ‘10’,
‘hydroxy’, ‘trans’, ‘3’, ‘oxadecalin’
then those terms will further be segmented into their
subterms
e.g., ‘oxadecalin’ will be segmented into ‘oxy’ and ‘decalin’
frequent sub-names can be mined, and can then be used for
segmenting chemical names into sub-terms
maximal frequent subsequence mining can be found in [Yang, 2004]
frequent subsequence mining and hierarchical segmentation details
can be found in [Sun et al., 2010]
23
Chemical Entity Search: Formula
Chemical Formula Search
can be grouped into 4 categories
Exact formula search:
user specifying a query formula gets back document having formulae
that match the query exactly
e.g., C1-2H4-6 will return CH4 or C2H6, but not H4C
Frequency search:
most current chemistry databases support frequency searches as the
only query models for formula searches
for a user query C2H4-6,
full-frequency returns two C and four to six H, but no other atoms
partial frequency returns 2C, 4H and any number of other atoms
24
Chemical Entity Search: Formula
Chemical Formula Search
Subsequence search
the system returns the documents with the formula that contain the
query formula as a subsequence
e.g., for query COOH,
COOH is exact match (high score), HOOC is reverse match (medium
score), CHO2 is parsed match (low score)
Similarity formula search
e.g., for query H2CO3
HC(O)OOH has higher ranking score than HNO3
computing similarity between the query formula and all formula in
text is expensive
extract a feature vector of partial formulas out of the query formula
(where each dimension is an indexed partial formula), and calculate
the score accordingly [Haussler, 1999; Sun et al., 2010]
25
Chemical Entity Search: Name
Chemical Name Search
Exact name search
the system returns the documents with the chemical names that
contain the exact query keyword
Substring name search
returns a ranked list of documents containing chemical names that
contain the user-provided keyword as a substring
if query string is indexed, results are retrieved directly
otherwise, query string is segmented hierarchically into substrings and
they are look up
segmenting continues until a substring is retrieved in the index
26
Chemical Entity Search
Conjunctive search
Conjunctive chemical formulae search
conjunctive searches of the basic chemical entity searches are
supported
for a formulae that have 2 to 4 C, four to ten H, and have a
subsequence of CH2
user can do a conjunctive search of a full frequency search C2-4H4-10, and
a subsequence formula search of CH2.
Conjunctive chemical name search
user can define multiple substrings in a query so that the satisfied
chemical name must contain both of them
chemical names where both substrings appear in order are given
higher priority than those only one appears
27
Chemical Entity Search
Query rewriting
when a user inputs a query that contains chemical formula,
chemical names as well as other keywords, the process of a
search engine is as follows:
chemical entity searches are executed to find desired names and
formulae
returned entities as well as other keywords (non-chemical-formula
and non-chemical-name) are used to retrieve related documents
TF.IDF can be used as the ranking function in the second stage
the ranking scores of each returned chemical entity in the first stage
can be used as weights of the TF.IDF of each chemical entity when
computing the ranking score in the second stage [Sun et al., 2010]
28
Text Retrieval Conference: Chemical Track
TREC 2009, 2010 Chemical Tracks
large scale domain specific (i.e., chemistry) IR evaluation tasks
following Legal Track and Genomics Tracks @ TREC
Data
1.3 million patent files from the chemical domain (classified under IPC
codes C and A61K)
All data in structured XML format
fields such as title, abstract, claims (for patents) can be identified easily and
should be utilized for the tasks
DTD are available for both patents and the scientific articles
images are available when publisher provides
chemical structure information is available in the form of CDX and
MOL files
29
Text Retrieval Conference: Chemical Track
TREC 2009, 2010 Chemical Tracks
Data
data from 3 major patent offices:
USPTO (US patent office),
EPO (European Patent Office), and
WIPO (World Intellectual Property Organization)
181,076 scientific articles from
The Royal Society of Chemistry
All open Access Journals from PubMed Central (as it was in Jan 2010)
Oxford Publishing,
Hindawi Publishing
International Union of Crystallography
Molecular Diversity Preservation International
30
Text Retrieval Conference: Chemical Track
TREC 2009, 2010 Chemical Tracks
Technology Survey
similar to the search engine / information need scenario described in
previous slides
30 expert created queries
In 2009: the task is to find relevant documents for a natural language
expression of an information need
In 2010: two tasks
first, the same task in 2009 – i.e., natural language query
second, structure search, the query is a chemical structure rather than a
chemical name
utilization of the document structure is crucial
utilization of the chemical entity mentions is crucial, yet quite hard in
such large scale data
31
Text Retrieval Conference: Chemical Track
TREC 2009, 2010 Chemical Tracks
Prior-Art Search
given a patent file, the task is to find all relevant patent files for the
query patent
1000 query patents
automatic evaluation process compare the set of identified relevant
patents with the set of known references from the query patent
generation of the search query from the given query patent is crucial
query patent is too long
structured nature of the query patent should be utilized
different fields give different information
chemical entity identification is very important, but hard to deal with
in such large scale data
re-ranking with respect to International Patent Codes (IPC) is
beneficial 32
References
Main References:
Sun, B., Mitra, P., Giles, L., Mueller, K. Identifying, indexing,
ranking chemical formulae and chemical names in digital
text. ACM TOIS, 2010.
Sun, B., Mitra, P., Giles, L. Mining, indexing, and searching for
textual chemical molecule information on the web. WWW,
2008.
Sun, B., Tan, Q., Mitra, P., Giles, L. Extraction and search of
chemical formulae in text document on the web. WWW,
2007.
Klinger, R., Kolarik, C., Fluck, J., Hofman-Apitius, M., Friedrich, C.
Detection of IUPAC and IUPAC-like chemical names.
Bioinformatics, 2008.
33
References
Main References:
Corbett, P., Murray-Rust, P. High-throughput identification of
chemistry in life-science texts. CompLife, 2006.
For original images & references to the mentioned tools, please
either conduct an online search with their names or refer to
the original articles above
34
Questions ?
Please let us know in case of any
questions/issues!
Further info: {scetinta, lsi}@cs.purdue.edu
35