Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe
description
Transcript of Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe
![Page 1: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/1.jpg)
DOCUMENT UPDATE DOCUMENT UPDATE SUMMARIZATION USING SUMMARIZATION USING
INCREMENTAL HIERARCHICAL INCREMENTAL HIERARCHICAL CLUSTERING CLUSTERING CIKM’10 (DINGDING WANG, TAO LI)CIKM’10 (DINGDING WANG, TAO LI)
Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe
1
![Page 2: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/2.jpg)
Preview Preview • Introduction• Incremental Hierarchical Clustering
Based Document Update Summarization• Incremental Hierarchical Sentence
Clustering (IHSC)oThe COBWEB algorithmoCOBWEB for text• Algorithm
• Evaluation measures • Experiments and results
2
![Page 3: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/3.jpg)
Introduction Introduction • Document summarization has
been receiving much attention due to• Increasing number of documents on
the internet• Helping readers to extract their
interested information efficiently • Most document summarization
techniques perform in a batch mode
3
![Page 4: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/4.jpg)
Introduction Introduction cont’scont’s
• Two most widely used summarization methods
• Firstly: Clustering based• Term sentence matrices formed from
the document• Sentences are grouped into different
clusters• Score is attached to each sentence
using average cosine similarity• Sentences with the highest score in
each cluster form the summary
4
![Page 5: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/5.jpg)
Introduction Introduction cont’scont’s
• Secondly: Graph-ranking based• Constructs a sentence graph, each node
is a sentence in a document collection• An edge is formed between sentence
pairs if• The similarity between a pair of sentence is
above the threshold• They belong to the same document
• Sentences are selected to form the summary by voting from their neighbors
5
![Page 6: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/6.jpg)
Introduction Introduction cont’scont’s
• With the rapid growth of document, • There is a necessity to update the
existing summaries when new documents arrives.
• Traditional methods are not suitable for this task• Most of the methods work in batch way:• Meaning that all the documents need to be
process again once new documents come, which causes inefficiency
6
![Page 7: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/7.jpg)
Introduction Introduction cont’scont’s
• In this paper • To integrate document
summarization techniques into an incremental hierarchical clustering framework • To be able to re-organize sentence
clusters immediately after new documents arrive so that their corresponding summaries can be updated efficiently.
7
![Page 8: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/8.jpg)
INCREMENTAL HIERARCHICAL INCREMENTAL HIERARCHICAL CLUSTERING BASED DOCUMENT CLUSTERING BASED DOCUMENT UPDATE SUMMARIZATIONUPDATE SUMMARIZATION
1. Framework 2. Preprocessing3. Incremental Hierarchical Sentence Clustering (IHSC)
I. The COBWEB algorithm
II. COBWEB for text4. Representative Sentence Selection for Each Node of the
Hierarchy5. The Algorithm
8
![Page 9: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/9.jpg)
Framework Framework
9
![Page 10: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/10.jpg)
Preprocessing Preprocessing
• Data preprocessing • Given a collection of documents
1. Decompose the documents into sentences
2. Stop words are removed 3. Word stemming is performed4. Sentence matrix is constructed and each
element is the term frequency
10
![Page 11: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/11.jpg)
Incremental Hierarchical Incremental Hierarchical Sentence Clustering (IHSC)Sentence Clustering (IHSC)• For update summarization system• Used an Incremental Hierarchical
Clustering (IHC)• Benefits of IHC method• The method can efficiently process the
dynamic documents, new documents are added• A hierarchy is built to facilitate users• The number of clusters is not pre-
defined
11
![Page 12: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/12.jpg)
The COBWEB algorithmThe COBWEB algorithm
• Used COBWEB, most popular incremental hierarchical clustering algorithms• Based on the heuristic measures called
Category Utility (CU)
• Clusters • Probability of a document belong to a cluster• Total number of clusters K
12
![Page 13: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/13.jpg)
The COBWEB algorithm The COBWEB algorithm cont’scont’s
• Ai = The ith attribute of the items being clustered
• Vij = jth value of the ith attributeFor example:A1 Є {male, female} , A2 Є {Red, Green, Blue}V12= female V22= Green
Probability matching guessing strategyExpected number of times we can correctly guess the value of multinomial variable Ai to be Vij for an item in a cluster kA good cluster, in which the attributes of the items take similar values will have high values
COBWEB maximizes sum score over all possible assignment of a document to a cluster
13
![Page 14: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/14.jpg)
The COBWEB algorithm The COBWEB algorithm cont’scont’s
• The COBWEB algorithm can perform• Insert: add the sentence into an existing
cluster • Create: create a new cluster • Merge: combine two clusters into a single
cluster• Split: divide an existing cluster into several
clusters
14
![Page 15: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/15.jpg)
The COBWEB algorithm The COBWEB algorithm cont’s cont’s
Example:
15
![Page 16: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/16.jpg)
COBWEB for textCOBWEB for text
• The COBWEB algorithm • Using normal attributes distribution is
not suitable for text data• Documents • Are represented in the “bag of words”
where terms are attributes• Best method• Calculating CU using Katz’s
distribution
16
![Page 17: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/17.jpg)
COBWEB for text COBWEB for text cont’scont’s
• Katz’s model• Assuming word i occurs k times in
document then
= 1 – (df/N)df = document frequency N = total number of
documents
p = (cf - df) / cfcf = collection frequency
= Pr(the word repeats | the word occurs )
Therefore:(1 - p) = the probability of the word occurring only once
17
![Page 18: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/18.jpg)
COBWEB for text COBWEB for text cont’scont’s
18
Substitute with p
K=0, using p δk =1Adding both formulasp(0) = 1- αpα = (1-p(0))/p
![Page 19: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/19.jpg)
COBWEB for text COBWEB for text cont’scont’s
Where attribute value f=Vij
to the contribution of the attribute i towards the category utility of the cluster k
19
![Page 20: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/20.jpg)
Representation sentence Representation sentence selection for Each Node of the selection for Each Node of the Hierarchy Hierarchy • Update summarization system• Select the most representative
sentences to summarize each node and subtrees• Once a new sentence arrives, the
sentence hierarchy is changed by either of the four operations
20
![Page 21: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/21.jpg)
Representation sentence Representation sentence selection for Each Node of the selection for Each Node of the Hierarchy Hierarchy cont’scont’s
• Case 1 : Insert a sentence into cluster k• Recalculate the representative sentence Rk of
cluster K
• Where • K : number of sentences in the cluster• Sim() : similarity function between sentence pairs• Cosine similarity
• α = parameter • α = 0.6
21
![Page 22: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/22.jpg)
Representation sentence Representation sentence selection for Each Node of the selection for Each Node of the Hierarchy Hierarchy cont’scont’s• Case 2: Create a new clusterk
• Newly sentence represents a new cluster• Rk = snew
• Case 3: Merge two clusters (clustera and clusterb ) into a new cluster (clusterc)• Sentence obtaining the higher similarity with
the query is selected as the representative sentence at the new merged node
22
![Page 23: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/23.jpg)
Representation sentence Representation sentence selection for Each Node of the selection for Each Node of the Hierarchy Hierarchy cont’scont’s• Case 4: split cluster into a set of
clusters• (clustera into cluster1, cluster2,…clustern)• Remove node a • Substitute it using the roots of its sub-
trees• Corresponding representative sentences
are the representative sentences for the original sub-tree roots
23
![Page 24: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/24.jpg)
The AlgorithmThe Algorithm• Input: a query/topic the user is interested in
a sequence of documents/sentences
1. Read one sentence and check if it is relevant to the given topic i.e., checkrelevance(sentence,topic)
24
![Page 25: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/25.jpg)
The Algorithm The Algorithm cont’scont’s
2. If relevant :initialize the hierarchy tree, sentence as the root
Otherwise: remove it and read in the next sentence and repeat Step1
: until root node is formed
3. repeat
25
![Page 26: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/26.jpg)
The Algorithm The Algorithm cont’scont’s
4. Read in the next sentence, start from the root node• If the node is a leaf, go to Step 5 otherwise
choose one of the following with the highest CU score
1. Insert a node and conduct case 1 summarization
2. Create a node and conduct case 2 summarization
3. Merge a node and conduct case 3 summarization
4. Split a node and conduct case 4 summarization5. If a leaf node is reached, create a new leaf node
and merge the old leaf and the new leaf into a node and case 2 and case 3 are conducted
26
![Page 27: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/27.jpg)
The Algorithm The Algorithm cont’scont’s
6. Until the stopping condition is satisfied
7. Cut the hierarchy tree at one layer to obtain a summary with the corresponding length.
• Output: A sentence hierarchy
The updated summary
27
![Page 28: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/28.jpg)
EXPERIMENTS EXPERIMENTS Data DescriptionBaselinesEvaluations MeasuresExperimental Results
28
![Page 29: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/29.jpg)
Data DescriptionData Description• Hurricane Wilman
Releases(Hurricane)• 1700 documents divided into 3 phases
• TAC 2008 Update Summarization Track (TAC08)• Benchmark dataset from update
summarization • 48 topics and 20 newswire articles in
each topic
29
![Page 30: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/30.jpg)
BaselinesBaselines
Baseline Description
Random Selects sentences randomly for each document collection
Centroid Extracts sentences according to centroid value, positional value and first sentence overlap
LexPageRank
Constructs a sentence connectivity graph based on cosine similarity then selects important sentences based on the concepts of eigenvector centrality
LSA Performs latent semantic analysis on terms by sentences matrix to select sentences having the greatest combined weights across all important topics
30
• Implemented the following used multi-document summarization methods as the baseline systems
![Page 31: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/31.jpg)
Evaluations MeasuresEvaluations Measures• Rouge toolkit• To compare with the human
summaries
Method Description
ROUGE-1 Uses unigrams
ROUGE-2 Uses bigrams
ROUGE-L Uses the longest common subsequence (LCS)
ROUGE-SU Skip-bigram plus unigram 31
• Count match(gram n) maximum number of n-grams co-occurring in a candidate summary• Count(gram n) number of n-grams in the reference summaries
![Page 32: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/32.jpg)
Experimental Results Experimental Results
32
![Page 33: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/33.jpg)
Experimental Results Experimental Results cont’scont’s
33
![Page 34: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/34.jpg)
Experimental Results Experimental Results cont’scont’s
34
![Page 35: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/35.jpg)
Conclusion Conclusion • Traditional methods perform in batch
way and are not suitable of incrementing summaries
• Incremental Hierarchical Clustering Based Document Update Summarization
• Incremental Hierarchical Sentence Clustering (IHSC)• Algorithm called COBWEB for text• Can perform Insert, Create, Merge, Split
• IHSC outperforms the traditional methods and its more efficient.
35
![Page 36: Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe](https://reader033.fdocuments.us/reader033/viewer/2022050909/568157a3550346895dc533da/html5/thumbnails/36.jpg)
THANK YOU!THANK YOU!
36