Post on 14-Dec-2015
TREC-CHEMThe TREC Chemical IR Track
Mihai Lupu1, John Tait1, Jimmy Huang2, Jianhan Zhu3
1 Information Retrieval Facility2 York University 3 University College London
1
Network of excellence co-funded by the 7th Framework Program of the European Comission , grant agreement number 258191
Motivation
• Increased awareness on behalf of the industry and regulatory authorities– Particularly in human-related chemistry
(pharma and cosmetics)– Particularly in IP-related contexts
• Increased availability of data and meta-data
• Different demands from professional users wrt other evaluation campaigns
3
Partners
• Collaboration– National Institute for Science and
Technology (US)– University College London (UK)– York University (Canada)
• Support from– Royal Society of Chemistry– Open access publishers– Experts in the field
• With the participation of– Research groups
4
Aims
• Assess the available Chemical Retrieval tools
• Generate interest among research groups for this domain
• Stimulate participation from industry• Generate new Chemical Retrieval
tools, at the intersection of chemoinformatics and text-mining
5
Data
• 2 collections• 2009– 1.2 million patent documents– 50k scientific articles– text only
• 2010– 1.3 million patent documents– 172k scientific articles– text, images, structure information
available6
2010 Data
• Patent data – Addition of WIPO patents– Addition of attachments (images, structure
data)
• Scientific articles– 3-fold increase, with attachments – Large mass from PubMed– Some directly from open access publishers:
IUCrJnls, Oxford Publishers, Hindawi Publishers, MPCI
7
2010 Data
• Patent data across IPC classesOrganic ChemistryOrganic Chemistry
Medical or Veterinary science; HygieneMedical or Veterinary science; Hygiene
Organic macromolecular compoundsOrganic macromolecular compounds
BioChemistryBioChemistry
Physical or chemical processes or apparatus in general
Physical or chemical processes or apparatus in general
Dyes; Paints; Polishes…Dyes; Paints; Polishes…
Petroleum; Gas..Petroleum; Gas..
8
Tasks
• Technology Survey (TS)– Search for all potentially relevant
documents, in both patents and scientific articles.
– 30 manually defined and evaluated topics• Prior Art (PA)– Search for patents that may invalidate a
given patent– 1000 automatically created and evaluated
topics (1000 patent files)
9
PA topics
• Tagline: recreate the citation list created by the patent examiner
• topic = patent application document• evaluation based on – applicant’s citations– examiner’s report– opposition citations (if any)
• only patent corpus used
10
TS topics
• topic = natural language information request
• evaluation done manually by– junior evaluators (students, others)– senior evaluators (topic creators)
• both patent and scientific articles requested
12
TS topics -example
<topic><number>TS-23</number><title>Titanium tetrafluoride for improving dental health</title><narrative>Titanium tetrafluoride can be used to prevent dental caries or tooth decay along
with other fluoride containing compounds. We are specifically looking for the use of Titanium tetrafluoride for improving dental health or preventing decay.
</narrative><details><chemicals>titanium tetrafluoride</chemicals><condition>tooth decay</condition></details><relevance>A document will be considered RELEVANT if it refers to the use of titanium
tetrafluoride for improving dental health, including caries or tooth decay
A document will be considered HIGHLY RELEVANT when it is RELEVANT and it refers to the use of titanium tetrafluoride within a product such as toothpaste or mouthwash.
</relevance></topic>
13
TS topics - example
<topic><number>TS-47</number><title>Structure Search</title><narrative>We are looking for patents and papers on use of the chemical described
in TS-47.mol and TS-47.png for treating dementia.</narrative><details></details><relevance>A document will be considered RELEVANT if it refers to the use of
chemical X for treating dementiaThere are no HIGHLY RELEVANT documents.</relevance></topic>
14
Participants
• 13 participants registered to download the data
• PA– 4 submitted 10 runs– BiTeM Geneva, York University,
Fraunhfer SCAI, Iowa University• TS– 2 submitted 12 runs– BiTeM Geneva, York University
15
Methods
• Basic Probabilistic Model, Language Model and Vector Space Model– Different sections, weights on each section– bm25
• Additional filtering/weighting based on IPC codes
• Linguistic processing– Emphasis on NP
• Concept based search– Query expansion– Using Oscar3, MeSH
16
Methods
• The addition of non-text data did not impact the methods – only 2 TS topics were purely structure
based
• TODO– define interesting structure based topics– find ways to solve them
17
Evaluation – PA topics
Topic PatentTopic Patent
DD DD
cite
s
cites
Family MemberFamily
Membersibling
F1F1 F1F1
cite
s
cites
F2F2
F2F2
F3F3
F3F3
18
Evaluation
• TS topics– Due to low participation -> pooling
method might have resulted in biased results
– However, still wanted to provide feedback to the 2 participating groups
– Evaluated 6 topics:• TS-21, TS-23, TS-30, TS-35, TS-36 and TS-43
20
Evaluation
• TS topics – qrelsTopic #poole
d#sampled
#relevant #highly relevant
#non relevant
TS-21 4500 616 16 2 597
TS-23 4762 648 2 4 641
TS-30 3852 525 5 3 517
TS-35 6036 797 5 3 789
TS-36 5048 679 62 13 594
TS-43 6005 761 74 15 672
23
Conclusions & Outlook
• This year, more than the last, was a dry-run for the next campaign
• Fixed test collection• 24 TS topics still to use next year• Main objective for 2011–More collaboration between structure-
based search and text-mining
27