Question Answering over Implicitly Structured Web Content

25
Question Answering over Implicitly Structured Web Content Eugene Agichtein* Emory University Chris Burges Microsoft Research Eric Brill Microsoft Research

description

Question Answering over Implicitly Structured Web Content. Eugene Agichtein* Emory University Chris BurgesMicrosoft Research Eric BrillMicrosoft Research * Research done while at Microsoft Research. Questions are Problematic for Web Search. - PowerPoint PPT Presentation

Transcript of Question Answering over Implicitly Structured Web Content

Page 1: Question Answering over Implicitly Structured Web Content

Question Answering over Implicitly Structured Web Content

Eugene Agichtein* Emory University

Chris Burges Microsoft Research

Eric Brill Microsoft Research

* Research done while at Microsoft Research

Page 2: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

What was the name of president Fillmore’s cat?

Who invented crocs? …

Questions are Problematic for Web Search

Page 3: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Web search: What was the name of

president Fillmore’s cat?

Page 4: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Web Question Answering

Why are questions problematic for web search engines?

Search engines treat questions as keyword queries, ignoring the semantic relationships between words, and the explicitly stated information need

Poor performance for long (> 5 terms) queries

Problem exacerbated when common keywords are included

Page 5: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

… and millions more of other tables and lists …

Page 6: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Implicitly Structured Web Content HTML Tables, Lists

Product descriptions Example: Lists of favorite things, “top 10” lists, etc.

HTML Syntax (sometimes) reflects semantics Authors imply semantic relationships, entity types by grouping Can infer information about ambiguous entities from others in the

same column

Millions of HTML tables, lists on the “surface” web alone No common schema Keyword queries: primary access method. How to exploit this structured content for good (e.g., for Question

Answering) at web scale?

Page 7: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Related Work Web Question Answering

AskMSR (TREC 2001) Aranea (TREC 2003) Mulder (WWW 2001) A No-Frills Architecture for Lightweight Answer Retrieval (WWW 2007)

Web-scale Information Extraction QXtract (ICDE 2003): learn keyword queries to retrieve content KnowItAll (WWW 2004): minimal supervision, larger scale TextRunner (IJCAI 2007): single pass scan, disambiguate at query time Towards Domain-Independent Information Extraction from Web Tables

(WWW 2007)

Page 8: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Our System TQA: Overview

1. Index all promising HTML tables

2. Translate a question into select/project query

3. Select table rows, project candidate answers

4. Rank candidate answers

5. Return top K answers

Page 9: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

TableQA: Indexing Crawl the Web Identify “promising”

tables (heuristic, could be improved)

Extract metadata for each table Context Document content Document metadata

Index extracted metadata

Page 10: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Table Metadata

Combines information about the source document, and table context

Page 11: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

TQA Question Processing

Page 12: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Table QA: Querying Overview

Page 13: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Features for Ranking Candidate Answers

Page 14: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Ranking Answer Candidates Frequency-based (AskMSR):

Heuristic weight assignment (AskMSR improved)

Neither is robust or general

Page 15: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Ranking Answer Candidates (cont) Solution: machine learning-based ranking

Naïve Bayes:

Score(answer) =

RankNet (Burges et al. 2005): scalable Neural Net implementation: Optimized for ranking – predicting an ordering of items,

not scores for each Trains on pairs (where first point is to be ranked higher

or equal to second) Uses cross entropy cost and gradient descent to set

weights

).|( ii

Fanswerrelevantp

Page 16: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Some Implementation Details

Lucene, distributed indices (20M tables per index)

NLP Tools: MS internal Named Entity tagger (many free ones exist) Porter Stemmer

Relatively light-weight architecture: Client (question processing): desktop machine Table index server: dual-processor, 8 Gb RAM, WinNT

Page 17: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Experimental Setup

Queries: TREC QA 2002, 2003 questions

Corpus: 100M web pages (a “random” subset of an MSN Search crawl, from 2005)

Evaluation: TREC QA factoid patterns “Minimal” regular expressions to match only right

answers Not comprehensive (based on judgement pool)

Page 18: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Evaluation Metrics

MRR (mean reciprocal rank): MRR @ K = , averaged over all

questions

Recall @ K: The fraction of the questions for which a system

returned a correct answer ranked at or above K.

Ki ianswerrel..1 )(

1

Page 19: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Results (1): Accuracy vs. Corpus Size

Page 20: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Results (2): Comparing Ranking Methods

If output consumed by another system, large K ok

Page 21: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Results (3): Accuracy on Hard Questions

TQA can retrieve answer in top 100 when best QA system not able to return any answer

Page 22: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Result Summary

Requires indexing more than 150M tables before respectable accuracy achieved

Performance was around median on TREC 2002, 2003 benchmarks

Can be helpful for questions difficult for traditional QA systems

Page 23: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Promising Directions for Future Work

Craw-time: aggressive pruning/classification Index-time: Integration of related tables Query-time: taxonomies integration/hypernimy

User behavior modeling Past clickthrough to rerank candidate tables,

answers Query reformulation

Page 24: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Conclusions

Implicitly structured web content can be useful for web question answering

We demonstrated scalability of a lightweight table-based web QA approach

Much room for improvement, future research

Page 25: Question Answering over Implicitly Structured Web Content

Agichtein et al., WI 2007

Thank you!

Questions?

E-mail: [email protected]: User Interactions for Web Question Answering:

http://www.mathcs.emory.edu/~eugene/uqa/

E. Agichtein, E. Brill, S. Dumais, Mining user behavior to improve web search ranking, SIGIR 2006

E. Agichtein, User Behavior Mining and Information Extraction: Towards closing the gap, IEEE Data Engineering Bulletin, Dec. 2006

E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne, Finding High Quality Content in Social Media with applications to Community-based Question Answering, to appear WSDM 2008