UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16,...

19
UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006

Transcript of UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16,...

Page 1: UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

UIC at TREC 2006: Genomics Track

Wei Zhou, Clement T. YuUniversity of Illinois at

ChicagoNov. 16, 2006

Page 2: UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

3 stages

Stage 1: Conversion - Greek letters English words

Stage 2: Paragraph retrieval- retrieve 2,000 most relevant paragraphs

Stage 3: Passage extraction and ranking- extract and retrieve 1,000 most relevant passages

Page 3: UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

Stage 1: conversion

Convert the Greek letters into English words, for example,

TGF β1 TGF beta1

(β, in the HTML documents, may be represented by “&#223” or “beta.gif”)

Page 4: UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

Stage 2: paragraph retrieval The goal of this stage is to retrieve

2,000 most relevant paragraphs.

Several techniques are utilized: 1. conditional porter stemming 2. gene symbol lexical variants handling 3. concept retrieval IR model 4. query expansion 5. abbreviation correction.

Page 5: UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

Stage 2: paragraph retrieval - conditional Porter stemming

Potential errors of the Porter stemmer Type 1: gene symbol non-gene word e.g., “Pes” “Pe”, “IDE” “ID” Type 2: non-gene word gene symbol

e.g., “IDEE” “IDE”

solution: a table (Entrez gene database) containing all the gene symbols is maintained.

Page 6: UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

Stage 2: paragraph retrieval- handling lexical variants of gene symbols

2 strategies: Strategy 1: automatically generate

lexical variants (Buttcher, 2004; Huang, 2005).

e.g., PLA2 PLA 2, PLAII, and PLA II Strategy 2: retrieve additional lexical

variants from a term database of MEDLINE (Zhou, 2006).

e.g., PLA2 PL-A2Note: PLA2: Phospholipase A2

Page 7: UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

Stage 2: paragraph retrieval- concept retrieval (IR model)

Definition: A concept is a biomedical meaning

or sense. 1) a gene and its synonym set refer

to the same concept; 2) a MeSH and its synonym set refer

to the same concept.

Page 8: UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

Stage 2: paragraph retrieval- concept retrieval (IR model)

Assumption: Okapi does not work well if the query contains multiple concepts. For example:

q: “role of gene PRNP in mad cow disease.” concept 1 concept 2 d1: has many occurrences of concept 2d2: has small number of occurrences of both

conceptsOkapi: sim(q,d1)>sim(q,d2), but intuitively d2 is

more relevant than d1.

Page 9: UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

Stage 2: paragraph retrieval- concept retrieval (IR model)

According to our model (Liu, 2004; UIC Robust track, 2005) , we have:

because:

concept conceptsim(q,d2) sim(q,d1)

sim(q,d2) sim(q,d1)

word wordsim(q,d2) sim(q,d1)

although,

conceptsim(q,d) includes both concept 1 &

concept 2

Page 10: UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

Stage 2: paragraph retrieval- query expansion

Synonyms Hyponyms (more specific terms) Pseudo-feedback Related terms

Page 11: UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

Stage 2: paragraph retrieval- query expansion using biomedical knowledge

Related terms (Co-occur frequently & related semantically)

q: How do interactions between HNF4 and COUP-TF1 suppress liver function"

There exists relationships between the semantic type of a related term and the semantic type of each query concept in UMLS semantic network.

Liver

Hepatocytes

Hepatoblastoma

Gluconeogenesis

Hepatitis B virus

HNF4 and COUP-tf I

related terms

Page 12: UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

Stage 2: paragraph retrieval- avoid incorrect match of abbreviations

Given a query with both an abbreviation of a gene symbol and its full form, a document will match the term only if both its abbreviation and its full form are matched. For example,q: role of APC (adenomatous polyposis coli) in colon cancer?

d: “…Much work has been undertaken in recent decades with the aim of producing projections of future cancer incidence and mortality rates from observed rates by using age-period-cohort (APC) models…”

Notice that gene symbols are usually abbreviations, which are very ambiguous in the biomedical literature.

Page 13: UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

Stage 3: passage extraction and ranking

The goal of this stage is to take the output of stage 2 (i.e., 2,000 most relevant paragraphs) and identify the 1,000 most relevant passages (i.e., one or more consecutive sentences within paragraphs).

Page 14: UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

Stage 3: passage extraction and ranking - extraction

The criterion for the optimal passage in a paragraph is given by:

“Given various windows of different sizes, choose the one which has the maximum number of query concepts and the smallest size.”

Page 15: UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

Stage 3: passage extraction and ranking- ranking

The ranking of passages is similar to the ranking of paragraphs. For each passage, we computed its concept similarity and word similarity with the query. Then the concept retrieval model is applied for the ranking.

Page 16: UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

Experiment results 3 runs:

UICgen1: the top 1,000 most relevant paragraphs were returned as the passages.

UICgen2: the top 1,000 optimal passages according to the criterion were returned (some bugs).

UICgen3: same as UICgen2, except the bugs were removed.

Page 17: UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

Experiment resultsMAP # best # > Median

UICgen1 0.5439 3 25UICgen2 0.5268 2 25UICgen3 0.5320 3 25

MAP # best # > MedianUICgen1 0.0750 0 25UICgen2 0.1243 0 25UICgen3 0.1479 7 25

MAP # best # > MedianUICgen1 0.4411 7 25UICgen2 0.3478 1 23UICgen3 0.3492 1 24

Document

Passage

Aspect

Page 18: UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

Reference Buttcher S, Clarke CLA, Cormack GV: Domain-specific synonym

expansion and validation for bio-medical information retrieval (MultiText experiments for TREC 2004). The Thirteenth Text REtrieval Con-ference (TREC 2004) Proceedings, 2004, Gaithers-burg, MD.

Huang X, Zhong M, Si L. York University at TREC 2005: Genomics Track. The Fourteenth Text RE-trieval Conference (TREC 2005) Proceedings, 2005, Gaithersburg, MD.

Zhou W, Torvik VI, Smalheiser NR. ADAM: Another Database of Abbreviations in MEDLINE. Bioinformatics 2006; 22(22): 2813-2818.

Liu S, Liu F, Yu C, and Meng WY. An Effective Approach to Document Retrieval via Utilizing WordNet and Recognizing Phrases. Proceedings of the 27th Annual International ACM SIGIR Confer-ence, pp.266-272, Sheffield, UK, July 2004.

Liu S, Yu C. UIC at TREC2005: Robust Track. The Fourteenth Text RE-trieval Conference (TREC 2005) Proceedings, 2005, Gaithersburg, MD.

Page 19: UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

Questions

Thanks!