A System for Query-Specific Document Summarization

33
A System for Query-Specific Document Summarization Ramakrishna Varadarajan, Vagelis Hristidis. FLORIDA INTERNATIONAL UNIVERSITY, School of Computing and Information Sciences, Miami.

Transcript of A System for Query-Specific Document Summarization

Page 1: A System for Query-Specific Document Summarization

A System for Query-Specific Document Summarization

Ramakrishna Varadarajan,

Vagelis Hristidis.

FLORIDA INTERNATIONAL UNIVERSITY,

School of Computing and Information Sciences,

Miami.

Page 2: A System for Query-Specific Document Summarization

Slide 1

Roadmap

•Need for query-specific summaries

•Our approach– Building a document graph

– Definition of summary

– Rank Summaries

• Efficient computation of summaries

• Evaluation of summarization process– Quality

– Performance

•Related Work

• Conclusions Florida International University (FIU)

Page 3: A System for Query-Specific Document Summarization

Slide 2

Roadmap

•Need for query-specific summaries

•Our approach– Building a document graph

– Definition of summary

– Rank Summaries

• Efficient computation of summaries

• Evaluation of summarization process– Quality

– Performance

•Related Work

• Conclusions Florida International University (FIU)

Page 4: A System for Query-Specific Document Summarization

Slide 3

Need for Query-Specific Summaries

• Locating relevant information is hard.

• Summaries are helpful because:– Provide a Quick preview of the document.

– Allow users to quickly decide relevance.

– Save user’s browsing time.

• Success of Web search engines – Query specific snippetsare important.

• Two categories of summaries:– Query-Independent – Most of prior works.

– Query-Specific – Applicable to web search engines.Florida International University (FIU)

Page 5: A System for Query-Specific Document Summarization

Slide 4

Motivation

Query-Specific Summaries

Florida International University (FIU)

Page 6: A System for Query-Specific Document Summarization

Slide 5

Motivation

Drawbacks

• Association between query keywords is unclear.

• Naïve approach for summarization.

• Ignores semantic relations between keywords in the document.

Summarization research till date

•Mostly Query-Independent.

• Not applicable for web search.

Florida International University (FIU)

Page 7: A System for Query-Specific Document Summarization

Slide 6

Roadmap

•Need for query-specific summaries

•Our approach– Building a document graph

– Definition of summary

– Rank Summaries

• Efficient computation of summaries

• Evaluation of summarization process– Quality

– Performance

•Related Work

• Conclusions Florida International University (FIU)

Page 8: A System for Query-Specific Document Summarization

Slide 7

Our Approach

• Document � graph

• We call it Document Graph.

Three Steps

Step 1: Preprocess

• Build a document graph, G.

Step 2: Summary Generation

• Given a query Q and a document graph G,

Summaries � Spanning Trees that cover all keywords

Step 3: Rank spanning trees.Florida International University (FIU)

Page 9: A System for Query-Specific Document Summarization

Slide 8

Building Document Graphs

• Parse the document.

• Split it into text fragments (using delimiters or tags).

• Text Fragments represented as Nodes

• Add an edge between 2 nodes, if semantically related.

• Edges : Semantic Links

• Edge weights: Degree of association

Florida International University (FIU)

Page 10: A System for Query-Specific Document Summarization

Slide 9

Example

v0

v3

v1

v2

0.026

0.015

0.015

v7 v13

v14

0.027

0.017 0.043v10

0.023

v4v8

v5

0.015

0.058 0.091

0.0150.053

0.027

0.028

0.015

v6

0.044

0.06

0.015

0.015

0.015

0.032

0.060

v90.015 0.015

v12

v110.015

0.015

0.015

0.015

v15v16 0.015

0.037

Sample DocumentDocument Graph

• Parsing delimiter – NewLine.

• Text Fragments – Paragraphs.

• 17 text fragments (v0…v16).

• 17 nodes in Document Graph.Florida International University (FIU)

Page 11: A System for Query-Specific Document Summarization

Slide 10

Input parameters for Document Graph construction

–Parsing Delimiters

• For Plain Text – Newline or Period

• For HTML – Tags (<p>,<br>,<ul><ol>,<table>…etc.)

–Threshold for Edge weights

• Tradeoff of Quality and Performance.

• Edges with weights lesser, are not added.

–Maximum Fragment Size

• Limit on Node Size

Florida International University (FIU)

Page 12: A System for Query-Specific Document Summarization

Slide 11

Computing edges of Document Graphs

• For every pair of nodes,

� Common Words are used (stops words – ignored)

� Thesaurus and stemmer used (rely on Oracle Intermedia Text services)

� If EScore(e) ≥ threshold, an edge is added.

• Special Case

� Adjacent Text Fragments.

� Share Close Proximity.

� Weight = Max (EScore(e) ,threshold).

Florida International University (FIU)

Page 13: A System for Query-Specific Document Summarization

Slide 12

Edge Scoring

• EScore

A tf*idf adaptation.

–Query Independent.

–Edge e(u,v)

w – common word,

t (v) – text fragment corresponding to node v.

Size (v) –number of words in text fragment t(v).

( )

))(())((

))())),(()),(((

)( ))()((

vtsizeutsize

widfwvttfwutft

eEScore vtutw

+

⋅+=

∑∈ I

Florida International University (FIU)

Page 14: A System for Query-Specific Document Summarization

Slide 13

Example (cont’d)

v0

v3

v1

v2

0.026

0.015

0.015

v7 v13

v14

0.027

0.017 0.043v10

0.023

v4v8

v5

0.015

0.058 0.091

0.0150.053

0.027

0.028

0.015

v6

0.044

0.06

0.015

0.015

0.015

0.032

0.060

v90.015 0.015

v12

v110.015

0.015

0.015

0.015

v15v16 0.015

0.037

Sample DocumentDocument Graph

Common words:

• BrainGate,

• CyberkineticsReasons for high weight

• Rare Words (idf is large).

Florida International University (FIU)

Page 15: A System for Query-Specific Document Summarization

Slide 14

Computing Query-Specific Summaries

• Given a Query, Q and a Document Graph, G:

Summary �Minimal Total Spanning Tree.

Minimal Total Spanning Tree

• Total – Every keyword in at least one node (AND semantics)

• Minimal – To avoid redundancy (Eliminating useless leaves)

Summarization Problem

Given – Document Graph G and a Query Q

Find – Top (best) Minimal Total Spanning Tree (Summary)Florida International University (FIU)

Page 16: A System for Query-Specific Document Summarization

Slide 15

Example

v0

v3

v1

v2

0.026

0.015

0.015

v7 v13

v14

0.027

0.017 0.043v10

0.023

v4v8

v5

0.015

0.058 0.091

0.0150.053

0.027

0.028

0.015

v6

0.044

0.06

0.015

0.015

0.015

0.032

0.060

v90.015 0.015

v12

v110.015

0.015

0.015

0.015

v15v16 0.015

0.037

Sample Document Document Graph

v100.017v0

0.046 0.008Top Summary for “Brain Chip Research"

Score = 67.74

Brain chip offers hope for paralyzed. Donoghue’s initial research published in the science journal Nature in 2002 consisted of attaching an implant to a monkey’s brain that enabled it to play a simple pinball computer game remotely.

Florida International University (FIU)

Page 17: A System for Query-Specific Document Summarization

Slide 16

Summary Scoring Function

Requirements

Properties of Good Summaries :

• Highly relevant nodes (fragments) improve Score.

• Loose semantic Links degrade Score.

• Large spanning trees get a degraded Score.

• Based on Query-dependent & Query-Independent factors.

Summary Scoring

• This function satisfies these requirements.

• Best Summary has minimum score

∑∑∈

+=Teedge

Tvnode

vNScoreb

eEScorea

)(

1)(

1Score(T)

a andb are calibrating parameters.

(a=1 & b=0.5)Florida International University (FIU)

Page 18: A System for Query-Specific Document Summarization

Slide 17

Summary Node Scoring

• Node Scoring

–Widely used Okapi weighting.

–Query Dependent.

–NScore (v) =

N – Number of Documents in the collection.

tf – Term Frequency .

df – Document Frequency.

avdl – Average Document Length.

qtfk

qtfk

tfavdl

dlbbk

tfk

df

dfN

dQt++

++−

++

+−∑∈ 3

3

1

1

,

)1(.

))1((

)1(.

5.0

5.0ln

Florida International University (FIU)

Page 19: A System for Query-Specific Document Summarization

Slide 18

Roadmap

•Need for query-specific summaries

•Our approach– Building a document graph

– Definition of summary

– Rank Summaries

•Efficient computation of summaries

• Evaluation of summarization process– Quality

– Performance

•Related Work

• Conclusions Florida International University (FIU)

Page 20: A System for Query-Specific Document Summarization

Slide 19

ALGORITHMS

• Adaptations of BANKS [ICDE02] Algorithms

• Input : Document Graph G and Query Q

• Output : Minimal Total Spanning trees (Summaries)

• Enumeration Algorithm.

• Expanding Search Algorithm.

Pre-computation:

–A Full text Index.

–All Pairs shortest paths for each document graph

(edge weight of edge e= 1/Escore(e)).

Florida International University (FIU)

Page 21: A System for Query-Specific Document Summarization

Slide 20

Roadmap

•Need for query-specific summaries

•Our approach– Building a document graph

– Definition of summary

– Rank Summaries

• Efficient computation of summaries

•Evaluation of summarization process– Quality

– Performance

•Related Work

• Conclusions Florida International University (FIU)

Page 22: A System for Query-Specific Document Summarization

Slide 21

User Surveys

• To evaluate the Quality of Summaries

• Subjects : 15 Students from FIU (all levels & various majors).

• Users evaluate summaries based on their Quality.

• Rating: 1 (least descriptive) to 5 (most descriptive)

• Surveys

–Comparison with Google & MSN Desktop.

–Comparison with DUC 2005 datasets.

Florida International University (FIU)

Page 23: A System for Query-Specific Document Summarization

Slide 22

Comparison with Google & MSN Desktop Engines

3.674.001.003.001.672.005

4.004.673.001.672.671.674

4.004.933.000.672.673.003

3.334.333.002.003.332.002

3.674.873.672.333.672.331

D2D1D2D1D2D1

Our ApproachMSN DesktopGoogle Desktop

Queries

Computer network security projectDeleted computer software5

Large research grantsWorm affected agencies4

Software projectsRecovering worm deleted files3

Algorithms development researchAnti-virus protection2

IT Research awardsMicrosoft worm protection1

Document D2Document D1Queries

Page 24: A System for Query-Specific Document Summarization

Slide 23

Performance Experiments

01

23

45

67

2 3 4 5Number of keywords in query

Pro

cess

ing tim

e(m

sec)

Mult i-result EnumerationMult i-result Expanding

0

1

2

3

4

5

6

2 3 4 5Number of keywords in the query

Pro

cess

ing T

ime (m

sec)

Top-1 Enumeration

Top-1 Expanding

Average times to calculate node weights

17.3311.509.375.31Time (msec)

5432Number of keywords

1.81.41.31.1Top-1 Expanding Search Algorithm.

2.782.11.81.4Top-1 Enumeration Algorithm.

5432Number of keywords

Average ranks of Top-1 Algorithms

News articles from science section of cnn.com

Florida International University (FIU)

Page 25: A System for Query-Specific Document Summarization

Slide 24

Roadmap

•Need for query-specific summaries

•Our approach– Building a document graph

– Definition of summary

– Rank Summaries

• Efficient computation of summaries

• Evaluation of summarization process– Quality

– Performance

•Related Work

• Conclusions Florida International University (FIU)

Page 26: A System for Query-Specific Document Summarization

Slide 25

Related Work

Document Summarization

• Mostly Query-Independent

• Summarizing Web Pages– Berger et.al [SIGIR 2000] synthesizes summaries.

– Paris et.al [CIKM 2000] uses anchor text (ignores content).

• Splitting Web pages in to blocks– Song et.al [WWW2004] Block importance models (learning algorithms)

– Cai et.al [SIGIR 2004] Block level link analysis

• Document modeled as Graphs– Lexrank : Sentence Centrality using link analysis.

– TextRank: “representative” sentences using link analysis.

Keyword Search in Data Graphs

– BANKS [ICDE 2002]: group-steiner tree problem

– DISCOVER, DBXplorer.

– XRANK[2003]: search in XML documents.Florida International University (FIU)

Page 27: A System for Query-Specific Document Summarization

Slide 26

Conclusions

• Method for Query-Specific Summarization.

• Exploiting inherent structure of documents for the purpose of Summarization.

• Enhanced User Satisfaction – User Surveys.

A Prototype of the System available at:

http://dbir.cs.fiu.edu/summarization

Florida International University (FIU)

Page 28: A System for Query-Specific Document Summarization

Slide 27

Thank You !!!

Questions ???

Florida International University (FIU)

Page 29: A System for Query-Specific Document Summarization

Slide 28

Enumeration Algorithm

Florida International University (FIU)

Page 30: A System for Query-Specific Document Summarization

Slide 29

Expanding Search Algorithm

w1

v4

v2

w3

v30.01

0.02

0.02

Florida International University (FIU)

Page 31: A System for Query-Specific Document Summarization

Slide 30

Comparison with DUC peers

3.672.33FT921-743.673.67FT943-16238

4.172.83FT922-133534.174.00FT943-16477

4.332.00FT921-9373.002.83FT931-3563

4.002.00FT922-1903.332.50FT944-8297

2.504.00FT921-77864.662.33FT941-3237

Our Approach

DUC PeerDoc. ID

Our approach

DUCPeerDoc. ID

Query 2 (Women in Parliaments)DUC Topic ID: d321f

Query 1 (International Organized Crime)

DUC Topic ID: d301i

Florida International University (FIU)

Page 32: A System for Query-Specific Document Summarization

Slide 31

DEMO

Florida International University (FIU)

Page 33: A System for Query-Specific Document Summarization

Slide 32

DEMO

Florida International University (FIU)