On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis...

22
On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis...

On Burstiness-Aware Search for Document Sequences

Theodoros Lappas Benjamin Arai

Manolis Platakis Dimitrios Kotsakos

Dimitrios Gunopulos

SIGKDD 2009

SIGKDD 2009 Theodoros Lappas

Outline

The Problem: How to effectively search through large document sequences (e.g. newspapers)

Previous Work

Using Bursty Terms to identify Events

Modeling Burstiness using Discrepancy Theory

Our Search Framework

Experiments

SIGKDD 2009 Theodoros Lappas

The Problem Given a large sequence of documents (e.g. a daily newspaper)

and a query of terms, find documents that discuss major events relevant to the query.

Consider the San Francisco Call : a daily 1900s newspaper

We are given the query <theater, disaster>

Two candidate events, relevant to the query:

The disastrous fire of 1903 in the Iroquois Theater in Chicago

A disastrous performance given by an actor in a local theater

Clearly the first event is far more influential: articles on this event should be ranked higher!

SIGKDD 2009 Theodoros Lappas

Previous Work

Burstiness explored in different domains

Burst Detection - Kleinberg 2002

Stream clustering - He et al. 2007

Graph Evolution - Kumar et al. 2003

Event Detection - Fung et al. 2005

Nothing on Burstiness-aware Search:

Standard Information Retrieval techniques do not consider the underlying events discussed in the collection.

Event Detection Techniques do not consider user input.

SIGKDD 2009 Theodoros Lappas

Burstiness

Bursty periods: periods of “unusually” high frequency

Unusual? Deviating from an expected baseline.

Major Events are discussed in numerous articles for an extended timeframe.

The event’s keywords exhibit high frequency bursts during the timeframe

Frequency of the term “earthquake”, as it appeared in the SF Call , (1908 - 1909).

SIGKDD 2009 Theodoros Lappas

Modeling Burstiness using Discrepancy Theory

Discrepancy: Used to express and quantify the deviation from the norm

In our case: find intervals on the timeline were the observed frequency differs the most from the expected frequency

Maximal Interval : One that does not include and is not included in an interval of higher score.

MAX-1: Linear-Time Algorithm for Maximal Interval Extraction.

SIGKDD 2009 Theodoros Lappas

Baseline - Discussion

Baseline can be dynamic :

– frequency sequence(s) from previous year(s)

– Time Series Decomposition to extract Seasonal, Trend and Irregular Components

SIGKDD 2009 Theodoros Lappas

A Diagram of our framework

SIGKDD 2009 Theodoros Lappas

Phase 1 : Preprocessing The output is the set of

terms to be monitored

The input is a raw document sequence.

Preprocessing Methods:

Stemming, Synonym matching, etc.

Stopwords Removal

Frequency Pruning for rare words

SIGKDD 2009 Theodoros Lappas

Phase 2 – Retrieval of Bursty Intervals

Input: A term

Output: Set of non-overlapping intervals + their burstiness scores

1) Create the frequency sequence for the term.

2) Extract bursty intervals using the MAX-1 algorithm

SIGKDD 2009 Theodoros Lappas

Phase 3 – Interval Indexing

Input: Set of bursty intervals for each term

Output: An Index of Intervals

Simple, easily updatable structure

Need to support multi-term queries

SIGKDD 2009 Theodoros Lappas

Inverted Interval Index

Up Next: Query Evaluation

SIGKDD 2009 Theodoros Lappas

Phase 4 : Top- k Evaluation for Multi-Term Queries

Customized Version of the Threshold Algorithm (TA) for top-k Evaluation.

Standard Version:

– Terms-to-Documents

– Each document either appears in a term’s list or not

Our Version (TA*):

– Terms-to-Intervals

– A bursty interval of a term t1 may overlap multiple intervals of a term t2.

Up Next: Experiments

SIGKDD 2009 Theodoros Lappas

Empirical Evaluation

San Francisco Call : a daily newspaper with publication dates between 1900-1909. ~400,000 articles

List of Major Events from 1900-1909 (from Wikipedia) + query for each event.

SIGKDD 2009 Theodoros Lappas

Major Events List

SIGKDD 2009 Theodoros Lappas

Experiment 1 - Query Expansion

1) Submit respective query for each event in Major Events List.

2) Get top interval

3) Report the 10 terms that appear in the most document titles within the interval

SIGKDD 2009 Theodoros Lappas

Example 1

Event:King Umberto I of Italy is assassinated by Italian-born anarchist Gaetano Bressi.

Query:“king assassination”

Umberto july state anarchist italy unit

Rome Bressi general police

SIGKDD 2009 Theodoros Lappas

Example 2

Event:Louis Bleriot is the first man to fly across the English Channel in an aircraft.

Query:“English channel”

flight july miles cross aviator attempt return Bleriot condition

machine

SIGKDD 2009 Theodoros Lappas

Experiment 2 – Burst Detection

1) Submit respective query for each event in Major Events List.

2) Get top reported interval

3) Compare with actual event date

We use MAX-1, MAX-2 to extract bursty intervals.

MAX-2 :

– Re-run MAX-1 on each interval

– Obtain nested structure

SIGKDD 2009 Theodoros Lappas

Examples

Event: A fire at the Iroquois Theater in Chicago kills 600.

Query: < theater, disaster>

ACTUAL MAX-1 MAX-2

Dec 30 1903 22 Dec - 20 Aug 31 Dec - 26 Jan

Event: A fire aboard the steamboat General Slocum in New York City’s East River kills 1,021.

Query: < steamboat, disaster >

ACTUAL MAX-1 MAX-2

Jun 15 1904 14 May - 4 Sep 16 Jun - 20 Jun

SIGKDD 2009 Theodoros Lappas

Conclusion

The 1st efficient end-to-end framework for burstiness-aware search in document sequences.

Future Work:

– Evaluate on even larger Corpora

– Evaluate on more types of text

SIGKDD 2009 Theodoros Lappas

Thank you!!!