DeepTilebars: Visualizing Term Distribution for Neural Information...

DeepTilebars: Visualizing Term Distribution for Neural

Information Retrieval

Zhiwen Tang Grace Hui Yang

InfoSenseDepartment of Computer Science

Georgetown University

AAAI 2019 @ Honolulu, Hawaii

Neural Information Retrieval (Neu-IR)

!2

• Ad-hoc Information Retrieval: to satisfy a user’s information need by retrieving and ranking documents from a document collection. E.g. Google, Lucene

query documents

relevance

• Neu-IR: Retrieval systems based on deep neural networks

What is Relevance: A Visualization Perspective

!3

scrollbar-based visualization, by Byrd, DL’99

Fisheye interface, by Hornbæk and Frøkjær, CHI’01

TileBars, by Hearst, CHI’95

• Visualizations offer efficient, direct, and informative feedback to a search engine user.

• What makes visualization practically valuable to humans might also be valuable to a deep neural network for its resemblance to human neurons

• Can we prove that?

TileBars

• TileBars visualizes how query terms distribute within a document and produce a compact image to explicitly show the relevance (the matches) to the user.

• Proposed by Marti A. Hearst in the 90s. (Hearst, CHI’95)

• Segment documents by topic

• Visualize the query term hits within each

segment

• Display the query terms vertically and the document segments horizontally

• The darker the cell, the more relevant the text

• It is called Word-to-Segment Matching

4

Matching Patterns

!5

Relevant

Relevant

Irrelevant

Irrelevant

en0005-66-32430: January 2008 :: California Tax Attorney Blog California Tax Attorney Blog Published byCalifornia Tax Lawyer Mitchell A. Port Blog Home FirmWebsite Practice Areas Contact Us Home > January 2008

January 30, 2008 Tax Avoidance Or Tax Evasion? Employment Tax Evasion Schemes California employers: be

en0010-68-18656:SignificantChangestoU.S.TaxationofExpatriatingCitizensandLong-TermResidents.[21K]Printed(pdf)version. [172K]October2005UnitedStatesandSwedenSignNewIncomeTaxTreatyProtocol.[8K]Printed(pdf)version. [127K]November2004IRSProposesSection482RegulationsonIntangiblePropertyand

en0004-27-03654: Franchise business opportunity - Fast food franchise opportunity, Food franchise opportunity,International franchise opportunity, Commercial janitorial franchise opportunity, Small business franchise,Quiznos franchise for sale - Franchise business opportunity Franchise opportunity in Canada Carpet flooring

en0004-27-03654: Franchise business opportunity - Fast food franchise opportunity, Food franchise opportunity,International franchise opportunity, Commercial janitorial franchise opportunity, Small business franchise,Quiznos franchise for sale - Franchise business opportunity Franchise opportunity in Canada Carpet flooring

Documents Word-to-Word matchingWord-to-Segment

matching

Query: California Franchise Tax Board

Difficult to distinguish

relevant from irrelevant

Easy to observe consecutive

matching blocks

Given a query, common relevance matching patterns include (Skorochod’ko 1971; Hearst 1997):

• Chained

• Ringed

• Monolith

• Piecewise

• Hierarchical

Chained

Ringed

Monolith

Piecewise

Case Study: Spam & Documents

!6

Excerpt from a web document:

Fast food franchise opportunity Restaurant franchise opportunity franchise opportunity uk Top franchise opportunity Small business franchise opportunity Canadian franchise opportunity New franchise opportunity Retail franchise opportunity Home based franchise opportunity Home based franchise opportunity business franchise opportunity starbucks Start up franchise opportunity Top ten franchise opportunity National franchise business opportunity show Fitness franchise opportunity franchise business opportunity show Entrepreneur franchise opportunity franchise opportunity in canada franchise business opportunity for sale franchise buying opportunity franchise opportunity canada Car wash franchise opportunity Us franchise opportunity Business directory franchise opportunity New franchise business opportunity Carpet flooring franchise opportunity Business franchise in opportunity Low cost franchise opportunity Child franchise opportunity Learning franchise opportunity Food franchise opportunity Coffee franchise opportunity International franchise opportunity Business franchise massage opportunity Small franchise opportunity

• Web documents often contain scripts, advertisements, and spam.

• Using word-to-word matching, these irrelevant documents may appear to be even “more” relevant to a query:

• E.g. For query “California Franchise Tax Board”, we can see that “franchise” appears repeatedly in a long spam passage.

• Luckily, using word-to-segment matching, a spam passage only contributes to one cell

• Easier to distinguish them from the relevant ones

Case Study: Proximity Queries

!7

Excerpt from a web document:

Sonoma Valley Hospital - Provides inpatient, outpatient and continuing care to the community. (Sonoma)

Specialty Healthcare Services Inc. - Long term acute care hospitals with locations throughout the United States. Headquartered in California.

St. Bernardine Medical Center - A member of the Catholic Healthcare West (CHW), a not-for-profit corporation sponsored by several religious communities.

St. Dominic's Hospital - Manteca, CA - One of six hospitals in the St. Josephs Regional Health System, a member of Catholic Healthcare West, a co-sponsored health ministry serving California, Arizona, and Nevada.

St. Elizabeth Community Hospital - Medical staff of more than 50 primary care physicians and specialists. Affiliated with Catholic Healthcare West. (Red Bluff)

St. Francis Medical Center - Serving the health care and social needs - body, mind, and spirit - of the communities of Southeast Los Angeles. The Center is founded upon and advances the healing ministry of Christ in the tradition of service established by St. Vincent de Paul, St. Louise de Marillac and St. Elizabeth Ann Seton.

St. Helena Hospital/St. Helena Center for Health - Located in Northern California's Napa Valley, a fully accredited, nonprofit community hospital offering a full range of acute-care and wellness services and programs.

St. Joseph's Behavioral Health Center - The Center is a licensed nonprofit psychiatric hospital serving central California. (Stockton)

St. Josephs Medical Center - Offers a full range of comprehensive services for people of all ages. (Stockton)

St. Jude Medical Center - Located in Fullerton. Information on a wide variety of medical services and community education programs.

• Proximity queries are AND queries that require all the query terms to appear together within a certain distance.

• e.g. Find phrase “sonoma county medical services” within 500 words

• Some proximity queries ask to span over a large window size; but most existing Neu-IR models can only handle them in a tight window, e.g. 2 or 3 words apart (Hui et al. 2017).

• Word-to-Segment matching can resolve this issue by merging topically coherent words into a single segment and obtain a stronger relevance signal from it.

The Proposed Work

• In the indexing phase for the search engine,

• Visualize matching between query and segments

• “Color” the grid with features that follow good information retrieval

principles

• Then, in the retrieval phase,

• Use an end-to-end deep learning model to obtain the final ranking

scores

8

Segmentation

• TextTiling by Hearst

• Hearst, Marti A. "Multi-paragraph

segmentation of expository text." ACL’94.

• Query independent, sequentially laid topics

9

Smoothed similarityscore

!"

!#

!$

%& %&'" %&'#%&(" Tokensequence

• 1. Token sequence generation

• A token sequence is like a fixed-length pseudo

sentence

• 2. Similarity computation

• The similarity for two neighboring sequences is

calculated over the two windows to which they each belong.

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` noevidence '' that any irregularities

took place . The jury further said in term-end presentments that the City Executive Committee , which had over-allcharge

of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was

conducted . The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye toinvestigate reports of

possible `` irregularities '' in the hard-fought primary which waswon by Mayor-nominate IvanAllen Jr. .

`` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in theelection , the number

of voters and the size of this city '' . The jury said it did find that many of Georgia's registration and

election laws `` are outmoded or inadequate and often ambiguous '’ . It recommended that Fulton legislators act ``to have these laws studied

and revised to the end of modernizing

!"

!#

!$

!%

!&

!'

!(

!)

!*

• 3. Boundary determination

• Boundaries are discovered at positions where

similarity scores dramatically drop

Standardize Matrix Dimensions

• All query-document interaction matrices are standardized into the same dimension

• Pad unto the maximum length of queries

10

• For short documents, add empty segments

• For long documents, merge the last segments

• A compact representation:

• More than 90% documents contain 30 fewer segments in Clueweb09,

• More than 80% documents contain 30 fewer segments in LETOR

!"!#!$!%!&

'" '# '$ '% '& '(

!"!#!$!%

'" '# '$ '%

!"!#!$!%!&

'" '# '$ '% '& '(

!"!#!$!%

'" '# '$ '% '& '( ') '*

Color the Cells

• Each cell represents an interaction between a query term and a document segment

• Each cell consists of three channels

• It is like the canonical RGB colors in

multimedia and image research

!11

tf(wi, Bj)

idf(wi) × 𝕀Bj (wi)maxt∈Bj

e−(vwi−vt)2

• In theory, more features can be included in the framework

• First channel: term frequency

• Second channel: inverse document frequency• Third channel: word similarity based on distributional hypothesis

DeepTileBars: The Neural Network

!12

1×#$ %&'( )

%&'(

*%&'(

+

relevancescore

1×(#$ − / + 1)

1×(#$ − 2 + 1)

3)4

3+4

3*4

3))

3+)

3*)

5)565789) 9: 97;

#<×#$… …

……

5>

9>

%&'(

>

1×(#$ − 1)3>4 3>)

…… …… ……

…… …… ……

?@@)

?@@>

?@@+

?@@*

#<×1

#<×2

#<×/

#<×2

RelevanceDetection DetectionResults RelevanceAggregationWord-to-segment

matchingRelevanceDecision

Bagging of Various Kernel Sizes

• Bagging of various kernel sizes

• Each segment corresponds to a topic

• Super topics are created by combining adjacent topics

• Relevance evaluation is performed at different granularity levels

!13

1segmentpercell

5 segmentspercell

10segmentspercell

3segmentspercell

Query: California franchise tax board (TREC Web 2011 q116)

Docucment: clueweb09-enwp00-69-13554

Overview of DeepTileBars

14

1×#$ %&'( )

%&'(

*%&'(

+

relevancescore

1×(#$ − / + 1)

1×(#$ − 2 + 1)

3)4

3+4

3*4

3))

3+)

3*)

5)565789) 9: 97;

#<×#$… …

……

5>

9>

%&'(

>

1×(#$ − 1)3>4 3>)

…… …… ……

…… …… ……

?@@)

?@@>

?@@+

?@@*

#<×1

#<×2

#<×/

#<×2

RelevanceDetection DetectionResults RelevanceAggregationWord-to-segment

matchingRelevanceDecision

z0k = CNNk(I ), k = 1,2,3,...,l

z1k = LSTMk(z0k ), k = 1,2,3,...,l

s = MLP([z11 , z12 , . . . , z

1k , . . . , z

1l ])

J(Θ) = ∑(q,d+,d−)

− log1

1 + e−(s(d+,Θ)−s(d−,Θ))Objective function:

In summary:

Burges, Chris, et al. "Learning to rank using gradient descent." ICML’05

Experimental Setup• Tasks: Ad-hoc Retrieval

• Datasets

• Text REtrieval Conference (TREC) Web Track 2010-2012 (Clarke et al. 2012)

• Collection: Clueweb09 CatB (50 million English webpages, crawled by CMU

from January to February, 2009 )

• 150 queries + 38,948 judged documents

• LETOR-MQ2008 (Qin and Liu 2013)

• Collection: Gov2 (25 million webpages, crawled from .gov sites in early 2004)

• 784 queries + 15,211 judged documents

• Evaluation Metrics

• NDCG (Järvelin and Kekäläinen 2002), ERR (Chapelle et al. 2009)

• Precision

!15

DCG =n

∑i=1

relilog2(i + 1) ERR =

n

∑i=1

1i

i−1

∏j=1

(1 − Rj)Ri

Experiments: Baseline Systems

• Traditional IR approaches

• BM25 (Robertson and Zaragoza, 2009)

• Language modeling (Zhai and Lafferty, 2017)

• TREC Best Runs:

• Sophisticated term weighting (Dinçer and Karaoglan, 2010; Elsayed, 2010)

• Simple neural nets but with abundant feature engineering (Boytsov and Belova 2011, Al-akashi

and Inkpen 2012)

• Neu-IR models

• DRMM (Guo et al. 2016)

• MatchPyramid (Pang et al. 2016)

• DeepRank (Pang et al. 2017)

• Duet (Mitra et al. 2017)

• HiNT (Fan et al. 2018)

• Variants of DeepTileBars: word-to-word vs. word-to-segment, different kernel sizes!16

Results: TREC 2010-2012 Web Tracks

!17

System ERR@20 NDCG@20

P@20TREC-Best 0.188 0.236 0.382

BM25 0.102 0.137 0.253LM 0.118 0.166 0.297

DRMM 0.127 0.184 0.346MatchPyramid 0.113 0.125 0.228

DeepRank 0.127 0.134 0.224HiNT 0.157 0.205 0.322

DeepTileBars(n_q * 1) 0.140 0.207 0.368DeepTileBars(n_q * 3) 0.150 0.212 0.369DeepTileBars(n_q * 5) 0.146 0.211 0.371DeepTileBars(n_q * 7) 0.142 0.207 0.366DeepTileBars(n_q * 9) 0.147 0.213 0.372

DeepTileBars(w2w, all kernels) 0.110 0.123 0.248DeepTileBars(w2s, all kernels) 0.168 0.229 0.384

The bigger the number, the better the search engine effectiveness.

Results: LETOR-MQ 2008

!18

System P@5 P@10 NDCG@5 NDCG@10

BM25 0.337 0.245 0.461 0.220

LM 0.323 0.236 0.441 0.206

DRMM 0.337 0.242 0.466 0.219

MatchPyramid 0.329 0.239 0.442 0.211

DeepRank 0.359 0.252 0.496 0.240

Duet 0.341 0.240 0.471 0.216

HiNT 0.367 0.255 0.501 0.244

DeepTileBars 0.427 0.320 0.553 0.256

The bigger the number, the better the search engine effectiveness.

• A new, light-weight Neu-IR model inspired by classical work in term distribution visualization

• We propose to use:

• Word-to-segment matching

• Bagging of multiple CNNs

• Why is it working?

• It is practically a hierarchical modeling of document structure

• Topic - super topic - super super topic -…. - document

• Better handles proximity queries & spam documents

• A possible new direction for Neu-IR: visualize relevance signals by first turning texts into images then using deep neural networks

!19

Conclusions

Thank you!

• Contacts:

• [email protected] (Zhiwen Tang)

• [email protected] (Grace Hui Yang)

• Slides: http://infosense.cs.georgetown.edu/~zt79/aaai2019_v3.key

• Codes: https://github.com/smt-HS/DeepTileBars-release

• Implementation of TextTiling in NLTK

• https://www.nltk.org/_modules/nltk/tokenize/

texttiling.html

• InfoSense Website:

• http://infosense.cs.georgetown.edu/

20
mailto:[email protected]:[email protected]://infosense.cs.georgetown.edu/~zt79/aaai2019_v3.keyhttp://infosense.cs.georgetown.edu/~zt79/aaai2019_v3.keyhttp://infosense.cs.georgetown.edu/~zt79/aaai2019_v3.keyhttps://github.com/smt-HS/DeepTileBars-releasehttps://github.com/smt-HS/DeepTileBars-releasehttps://www.nltk.org/_modules/nltk/tokenize/texttiling.htmlhttps://www.nltk.org/_modules/nltk/tokenize/texttiling.html

DeepTilebars: Visualizing Term Distribution for Neural Information...

Documents

Transcript of DeepTilebars: Visualizing Term Distribution for Neural Information...