DeepTilebars: Visualizing Term Distribution for Neural Information...

20
DeepTilebars: Visualizing Term Distribution for Neural Information Retrieval Zhiwen Tang Grace Hui Yang InfoSense Department of Computer Science Georgetown University AAAI 2019 @ Honolulu, Hawaii

Transcript of DeepTilebars: Visualizing Term Distribution for Neural Information...

  • DeepTilebars: Visualizing Term Distribution for Neural

    Information Retrieval

    Zhiwen Tang Grace Hui Yang

    InfoSenseDepartment of Computer Science

    Georgetown University

    AAAI 2019 @ Honolulu, Hawaii

  • Neural Information Retrieval (Neu-IR)

    !2

    • Ad-hoc Information Retrieval: to satisfy a user’s information need by retrieving and ranking documents from a document collection. E.g. Google, Lucene

    query documents

    relevance

    • Neu-IR: Retrieval systems based on deep neural networks

  • What is Relevance: A Visualization Perspective

    !3

    scrollbar-based visualization, by Byrd, DL’99

    Fisheye interface, by Hornbæk and Frøkjær, CHI’01

    TileBars, by Hearst, CHI’95

    • Visualizations offer efficient, direct, and informative feedback to a search engine user.

    • What makes visualization practically valuable to humans might also be valuable to a deep neural network for its resemblance to human neurons

    • Can we prove that?

  • TileBars

    • TileBars visualizes how query terms distribute within a document and produce a compact image to explicitly show the relevance (the matches) to the user.

    • Proposed by Marti A. Hearst in the 90s. (Hearst, CHI’95)

    • Segment documents by topic

    • Visualize the query term hits within each

    segment

    • Display the query terms vertically and the document segments horizontally

    • The darker the cell, the more relevant the text

    • It is called Word-to-Segment Matching

    4

  • Matching Patterns

    !5

    Relevant

    Relevant

    Irrelevant

    Irrelevant

    en0005-66-32430: January 2008 :: California Tax Attorney Blog California Tax Attorney Blog Published byCalifornia Tax Lawyer Mitchell A. Port Blog Home FirmWebsite Practice Areas Contact Us Home > January 2008

    January 30, 2008 Tax Avoidance Or Tax Evasion? Employment Tax Evasion Schemes California employers: be

    en0010-68-18656:SignificantChangestoU.S.TaxationofExpatriatingCitizensandLong-TermResidents.[21K]Printed(pdf)version. [172K]October2005UnitedStatesandSwedenSignNewIncomeTaxTreatyProtocol.[8K]Printed(pdf)version. [127K]November2004IRSProposesSection482RegulationsonIntangiblePropertyand

    en0004-27-03654: Franchise business opportunity - Fast food franchise opportunity, Food franchise opportunity,International franchise opportunity, Commercial janitorial franchise opportunity, Small business franchise,Quiznos franchise for sale - Franchise business opportunity Franchise opportunity in Canada Carpet flooring

    en0004-27-03654: Franchise business opportunity - Fast food franchise opportunity, Food franchise opportunity,International franchise opportunity, Commercial janitorial franchise opportunity, Small business franchise,Quiznos franchise for sale - Franchise business opportunity Franchise opportunity in Canada Carpet flooring

    Documents Word-to-Word matchingWord-to-Segment

    matching

    Query: California Franchise Tax Board

    Difficult to distinguish

    relevant from irrelevant

    Easy to observe consecutive

    matching blocks

    Given a query, common relevance matching patterns include (Skorochod’ko 1971; Hearst 1997):

    • Chained

    • Ringed

    • Monolith

    • Piecewise

    • Hierarchical

    Chained

    Ringed

    Monolith

    Piecewise

  • Case Study: Spam & Documents

    !6

    Excerpt from a web document:

    Fast food franchise opportunity Restaurant franchise opportunity franchise opportunity uk Top franchise opportunity Small business franchise opportunity Canadian franchise opportunity New franchise opportunity Retail franchise opportunity Home based franchise opportunity Home based franchise opportunity business franchise opportunity starbucks Start up franchise opportunity Top ten franchise opportunity National franchise business opportunity show Fitness franchise opportunity franchise business opportunity show Entrepreneur franchise opportunity franchise opportunity in canada franchise business opportunity for sale franchise buying opportunity franchise opportunity canada Car wash franchise opportunity Us franchise opportunity Business directory franchise opportunity New franchise business opportunity Carpet flooring franchise opportunity Business franchise in opportunity Low cost franchise opportunity Child franchise opportunity Learning franchise opportunity Food franchise opportunity Coffee franchise opportunity International franchise opportunity Business franchise massage opportunity Small franchise opportunity

    • Web documents often contain scripts, advertisements, and spam.

    • Using word-to-word matching, these irrelevant documents may appear to be even “more” relevant to a query:

    • E.g. For query “California Franchise Tax Board”, we can see that “franchise” appears repeatedly in a long spam passage.

    • Luckily, using word-to-segment matching, a spam passage only contributes to one cell

    • Easier to distinguish them from the relevant ones

  • Case Study: Proximity Queries

    !7

    Excerpt from a web document:

    Sonoma Valley Hospital - Provides inpatient, outpatient and continuing care to the community. (Sonoma)

    Specialty Healthcare Services Inc. - Long term acute care hospitals with locations throughout the United States. Headquartered in California.

    St. Bernardine Medical Center - A member of the Catholic Healthcare West (CHW), a not-for-profit corporation sponsored by several religious communities.

    St. Dominic's Hospital - Manteca, CA - One of six hospitals in the St. Josephs Regional Health System, a member of Catholic Healthcare West, a co-sponsored health ministry serving California, Arizona, and Nevada.

    St. Elizabeth Community Hospital - Medical staff of more than 50 primary care physicians and specialists. Affiliated with Catholic Healthcare West. (Red Bluff)

    St. Francis Medical Center - Serving the health care and social needs - body, mind, and spirit - of the communities of Southeast Los Angeles. The Center is founded upon and advances the healing ministry of Christ in the tradition of service established by St. Vincent de Paul, St. Louise de Marillac and St. Elizabeth Ann Seton.

    St. Helena Hospital/St. Helena Center for Health - Located in Northern California's Napa Valley, a fully accredited, nonprofit community hospital offering a full range of acute-care and wellness services and programs.

    St. Joseph's Behavioral Health Center - The Center is a licensed nonprofit psychiatric hospital serving central California. (Stockton)

    St. Josephs Medical Center - Offers a full range of comprehensive services for people of all ages. (Stockton)

    St. Jude Medical Center - Located in Fullerton. Information on a wide variety of medical services and community education programs.

    • Proximity queries are AND queries that require all the query terms to appear together within a certain distance.

    • e.g. Find phrase “sonoma county medical services” within 500 words

    • Some proximity queries ask to span over a large window size; but most existing Neu-IR models can only handle them in a tight window, e.g. 2 or 3 words apart (Hui et al. 2017).

    • Word-to-Segment matching can resolve this issue by merging topically coherent words into a single segment and obtain a stronger relevance signal from it.

  • The Proposed Work

    • In the indexing phase for the search engine,

    • Visualize matching between query and segments

    • “Color” the grid with features that follow good information retrieval

    principles

    • Then, in the retrieval phase,

    • Use an end-to-end deep learning model to obtain the final ranking

    scores

    8

  • Segmentation

    • TextTiling by Hearst

    • Hearst, Marti A. "Multi-paragraph

    segmentation of expository text." ACL’94.

    • Query independent, sequentially laid topics

    9

    Smoothed similarityscore

    !"

    !#

    !$

    %& %&'" %&'#%&(" Tokensequence

    • 1. Token sequence generation

    • A token sequence is like a fixed-length pseudo

    sentence

    • 2. Similarity computation

    • The similarity for two neighboring sequences is

    calculated over the two windows to which they each belong.

    The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` noevidence '' that any irregularities

    took place . The jury further said in term-end presentments that the City Executive Committee , which had over-allcharge

    of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was

    conducted . The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye toinvestigate reports of

    possible `` irregularities '' in the hard-fought primary which waswon by Mayor-nominate IvanAllen Jr. .

    `` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in theelection , the number

    of voters and the size of this city '' . The jury said it did find that many of Georgia's registration and

    election laws `` are outmoded or inadequate and often ambiguous '’ . It recommended that Fulton legislators act ``to have these laws studied

    and revised to the end of modernizing

    !"

    !#

    !$

    !%

    !&

    !'

    !(

    !)

    !*

    • 3. Boundary determination

    • Boundaries are discovered at positions where

    similarity scores dramatically drop

  • Standardize Matrix Dimensions

    • All query-document interaction matrices are standardized into the same dimension

    • Pad unto the maximum length of queries

    10

    • For short documents, add empty segments

    • For long documents, merge the last segments

    • A compact representation:

    • More than 90% documents contain 30 fewer segments in Clueweb09,

    • More than 80% documents contain 30 fewer segments in LETOR

    !"!#!$!%!&

    '" '# '$ '% '& '(

    !"!#!$!%

    '" '# '$ '%

    !"!#!$!%!&

    '" '# '$ '% '& '(

    !"!#!$!%

    '" '# '$ '% '& '( ') '*

  • Color the Cells

    • Each cell represents an interaction between a query term and a document segment

    • Each cell consists of three channels

    • It is like the canonical RGB colors in

    multimedia and image research

    !11

    tf(wi, Bj)

    idf(wi) × 𝕀Bj (wi)maxt∈Bj

    e−(vwi−vt)2

    • In theory, more features can be included in the framework

    • First channel: term frequency

    • Second channel: inverse document frequency• Third channel: word similarity based on distributional hypothesis

  • DeepTileBars: The Neural Network

    !12

    1×#$ %&'( )

    %&'(

    *%&'(

    +

    relevancescore

    1×(#$ − / + 1)

    1×(#$ − 2 + 1)

    3)4

    3+4

    3*4

    3))

    3+)

    3*)

    5)565789) 9: 97;

    #<×#$… …

    ……

    5>

    9>

    %&'(

    >

    1×(#$ − 1)3>4 3>)

    …… …… ……

    …… …… ……

    ?@@)

    ?@@>

    ?@@+

    ?@@*

    #<×1

    #<×2

    #<×/

    #<×2

    RelevanceDetection DetectionResults RelevanceAggregationWord-to-segment

    matchingRelevanceDecision

  • Bagging of Various Kernel Sizes

    • Bagging of various kernel sizes

    • Each segment corresponds to a topic

    • Super topics are created by combining adjacent topics

    • Relevance evaluation is performed at different granularity levels

    !13

    1segmentpercell

    5 segmentspercell

    10segmentspercell

    3segmentspercell

    Query: California franchise tax board (TREC Web 2011 q116)

    Docucment: clueweb09-enwp00-69-13554

  • Overview of DeepTileBars

    14

    1×#$ %&'( )

    %&'(

    *%&'(

    +

    relevancescore

    1×(#$ − / + 1)

    1×(#$ − 2 + 1)

    3)4

    3+4

    3*4

    3))

    3+)

    3*)

    5)565789) 9: 97;

    #<×#$… …

    ……

    5>

    9>

    %&'(

    >

    1×(#$ − 1)3>4 3>)

    …… …… ……

    …… …… ……

    ?@@)

    ?@@>

    ?@@+

    ?@@*

    #<×1

    #<×2

    #<×/

    #<×2

    RelevanceDetection DetectionResults RelevanceAggregationWord-to-segment

    matchingRelevanceDecision

    z0k = CNNk(I ), k = 1,2,3,...,l

    z1k = LSTMk(z0k ), k = 1,2,3,...,l

    s = MLP([z11 , z12 , . . . , z

    1k , . . . , z

    1l ])

    J(Θ) = ∑(q,d+,d−)

    − log1

    1 + e−(s(d+,Θ)−s(d−,Θ))Objective function:

    In summary:

    Burges, Chris, et al. "Learning to rank using gradient descent." ICML’05

  • Experimental Setup• Tasks: Ad-hoc Retrieval

    • Datasets

    • Text REtrieval Conference (TREC) Web Track 2010-2012 (Clarke et al. 2012)

    • Collection: Clueweb09 CatB (50 million English webpages, crawled by CMU

    from January to February, 2009 )

    • 150 queries + 38,948 judged documents

    • LETOR-MQ2008 (Qin and Liu 2013)

    • Collection: Gov2 (25 million webpages, crawled from .gov sites in early 2004)

    • 784 queries + 15,211 judged documents

    • Evaluation Metrics

    • NDCG (Järvelin and Kekäläinen 2002), ERR (Chapelle et al. 2009)

    • Precision

    !15

    DCG =n

    ∑i=1

    relilog2(i + 1) ERR =

    n

    ∑i=1

    1i

    i−1

    ∏j=1

    (1 − Rj)Ri

  • Experiments: Baseline Systems

    • Traditional IR approaches

    • BM25 (Robertson and Zaragoza, 2009)

    • Language modeling (Zhai and Lafferty, 2017)

    • TREC Best Runs:

    • Sophisticated term weighting (Dinçer and Karaoglan, 2010; Elsayed, 2010)

    • Simple neural nets but with abundant feature engineering (Boytsov and Belova 2011, Al-akashi

    and Inkpen 2012)

    • Neu-IR models

    • DRMM (Guo et al. 2016)

    • MatchPyramid (Pang et al. 2016)

    • DeepRank (Pang et al. 2017)

    • Duet (Mitra et al. 2017)

    • HiNT (Fan et al. 2018)

    • Variants of DeepTileBars: word-to-word vs. word-to-segment, different kernel sizes!16

  • Results: TREC 2010-2012 Web Tracks

    !17

    System ERR@20 NDCG@20

    P@20TREC-Best 0.188 0.236 0.382

    BM25 0.102 0.137 0.253LM 0.118 0.166 0.297

    DRMM 0.127 0.184 0.346MatchPyramid 0.113 0.125 0.228

    DeepRank 0.127 0.134 0.224HiNT 0.157 0.205 0.322

    DeepTileBars(n_q * 1) 0.140 0.207 0.368DeepTileBars(n_q * 3) 0.150 0.212 0.369DeepTileBars(n_q * 5) 0.146 0.211 0.371DeepTileBars(n_q * 7) 0.142 0.207 0.366DeepTileBars(n_q * 9) 0.147 0.213 0.372

    DeepTileBars(w2w, all kernels) 0.110 0.123 0.248DeepTileBars(w2s, all kernels) 0.168 0.229 0.384

    The bigger the number, the better the search engine effectiveness.

  • Results: LETOR-MQ 2008

    !18

    System P@5 P@10 NDCG@5 NDCG@10

    BM25 0.337 0.245 0.461 0.220

    LM 0.323 0.236 0.441 0.206

    DRMM 0.337 0.242 0.466 0.219

    MatchPyramid 0.329 0.239 0.442 0.211

    DeepRank 0.359 0.252 0.496 0.240

    Duet 0.341 0.240 0.471 0.216

    HiNT 0.367 0.255 0.501 0.244

    DeepTileBars 0.427 0.320 0.553 0.256

    The bigger the number, the better the search engine effectiveness.

  • • A new, light-weight Neu-IR model inspired by classical work in term distribution visualization

    • We propose to use:

    • Word-to-segment matching

    • Bagging of multiple CNNs

    • Why is it working?

    • It is practically a hierarchical modeling of document structure

    • Topic - super topic - super super topic -…. - document

    • Better handles proximity queries & spam documents

    • A possible new direction for Neu-IR: visualize relevance signals by first turning texts into images then using deep neural networks

    !19

    Conclusions

  • Thank you!

    • Contacts:

    [email protected] (Zhiwen Tang)

    [email protected] (Grace Hui Yang)

    • Slides: http://infosense.cs.georgetown.edu/~zt79/aaai2019_v3.key

    • Codes: https://github.com/smt-HS/DeepTileBars-release

    • Implementation of TextTiling in NLTK

    • https://www.nltk.org/_modules/nltk/tokenize/

    texttiling.html

    • InfoSense Website:

    • http://infosense.cs.georgetown.edu/

    20

    mailto:[email protected]:[email protected]://infosense.cs.georgetown.edu/~zt79/aaai2019_v3.keyhttp://infosense.cs.georgetown.edu/~zt79/aaai2019_v3.keyhttp://infosense.cs.georgetown.edu/~zt79/aaai2019_v3.keyhttps://github.com/smt-HS/DeepTileBars-releasehttps://github.com/smt-HS/DeepTileBars-releasehttps://www.nltk.org/_modules/nltk/tokenize/texttiling.htmlhttps://www.nltk.org/_modules/nltk/tokenize/texttiling.html