1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1...

19
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Mili c-Frayling 1 Jozef Stefan Institute Ljubljana, Slovenia 2 Microsoft Research Cambridge, UK Presenter: Yi-Ting Chen

Transcript of 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1...

Page 1: 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

1

Learning Sub-structures of Document Semantic Graphs for Document Summarization

1Jure Leskovec, 1Marko Grobelnik, 2Natasa Milic-Frayling1Jozef Stefan Institute Ljubljana, Slovenia

2Microsoft Research Cambridge, UK

Presenter: Yi-Ting Chen

Page 2: 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

2

Reference

• Jure Leskovec, Marko Grobelnik, Natasa Milic-Frayling, “Learning Sub-structures of Document Semantic Graphs for Document Summarization”, LinkKDD 2004, August 2004, Seattle, WA, USA.

Page 3: 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

3

Outline

• Introduction• Semantic Graph Generation• Learning Semantic Sub-Graphs Using Support Vector

Machines & Experimental• Conclusions

Page 4: 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

4

Introduction (1/2)

• In this paper, identification of textual elements for using in extracting summaries are primarily concerned– The assumption that capturing semantic structure of a document

is essential for summarization– Then, to create semantic representations of the document,

visualized as semantic graphs, and learn the model to extract sub-structures that could be used in document extracts

• The basic elements of our semantic graphs are logical form triples, subject-predicate-object

• Experiments show that characteristics of the semantic graphs are most prominent attributes in the learnt SVM model

Page 5: 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

5

Introduction (2/2)

• Machine learning algorithms are applied to capture characteristics of human extracted summary sentences

• Elementary syntactic structures from individual sentences in the form of logical form triples are extracted

• Then using semantic properties of nodes in the triples to build semantic graphs for both documents and corresponding summaries

• This paper use logical form triples as basic features and apply Support Vector Machines to learn the summarization model

Page 6: 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

6

Semantic Graph Generation (1/7)

• To generating semantic representations of documents and summaries involves five phases:– Deep syntactic analysis– Co-reference resolution– Pronominal reference resolution– Semantic normalization– Semantic graph analysis

Page 7: 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

7

Semantic Graph Generation (2/7)

• An example that outlines the main process and output of different stages involved in generating the semantic graph

To perform deep syntactic analysis to the raw text

The analysis applying several methods: co-reference resolution, pronominal anaphora resolution, and semantic normalization

Resulting triples are then linked based on common concepts into a semantic graph

Page 8: 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

8

Semantic Graph Generation (3/7)

• Linguistic Analysis– For linguistic analysis of text we use Microsoft’s NLPWin natural l

anguage processing tool– NLPWin first segments the text into individual sentences, convert

s sentence text into a parse tree (Figure2) and then produces a sentence logical form that reflects the meaning (Figure3)

The logical form (triples): “Jure”←”send”→”Marko”

For each node, semantic tags that are assigned by the NLPWin software

Page 9: 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

9

Semantic Graph Generation (4/7)

• Co-reference Resolution For Named Entities– In document it is common that terms with different surface forms

refer to the same entity– The algorithm, for example, correctly finds that “Hillary Rodham

Clinton”, “Hillary Clinton”, “Hillary Rodham”, and “Mrs. Clinton” all refer to the same entity

• Pronomial Anaphora Resolution– To focus on resolving only five basic types of pronouns including

all different forms: “he” (his, him, himself), “she”, “I”, “they”, and “who”

– For a given pronoun, to search sequentially through sentences, backward and forward and check the type of named entity in order to find suitable candidates

– Each candidate is scored and among the scored candidate entities to chose the one with the best score

Page 10: 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

10

Semantic Graph Generation (5/7)

• Pronomial Anaphora Resolution– The set contained 1506 pronouns of which 1024 (68%) were fro

m the above selected pronoun types– Average accuracy over the 5 selected pronoun types is 82.1%

Page 11: 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

11

Semantic Graph Generation (6/7)

• Semantic Normalization– In order to increase the coherence and compactness of the grap

hs, WordNet can be used to establish synonymous relationships among node

– Synonymy information of concepts enables us to find equivalence relations between distinct triples

– For example, Watcher ← follow →moon

spectator← watch →moon

• Construction of a Semantic Graph– To merge the logical form triples on subject and object nodes whi

ch belong to the same normalized semantic class and thus produce semantic graph

– For each node, to calculate in/out degree of a node, PageRank weight, Hubs and Authorities weights and many more

Page 12: 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

12

Semantic Graph Generation

Page 13: 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

13

Learning Semantic Sub-Graphs Using Support Vector Machines & Experimental

• DUC Dataset– Document Understanding Conference (DUC) 2002 (300

newspaper on 30 different topics)– For each article have a 100 word human written abstract and for

half of them also human extracted sentences– An average article in the data set contains about 1000 words or

50 sentences– About 7.5 sentences are selected into the summary– After applying linguistic processing, on average 82 logical triples

per document with 15 of them contained in extracted summary sentence

• Feature Set – The resulting vector for logical form triples contain about 466

binary and real-valued attributes– 72 of these components have non-zero values, on average

Page 14: 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

14

Learning Semantic Sub-Graphs Using Support Vector Machines & Experimental

• Experiment results– Impact of the Training Data Size– Sentence classifiers are trained and tested through 10-fold

cross-validation experiments– Sizes of training data: 10, 20, 50, and 100 documents– A negative impact of small training sets on the generalization of

the model

Page 15: 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

15

Learning Semantic Sub-Graphs Using Support Vector Machines & Experimental

• Experiment results– Impact of Different Feature Attributes

Page 16: 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

16

Learning Semantic Sub-Graphs Using Support Vector Machines & Experimental

• Experiment results– Impact of Different Feature Attributes– Semantic graph attributes are consistently ranked high among all

the attributes used in the model– Syntactic and semantic properties of the text are important for

summarization

Page 17: 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

17

Learning Semantic Sub-Graphs Using Support Vector Machines & Experimental

• Experiment results– Topic Specific Summarization– While the data size for topic specific summarization is small and

thus does not allow generalization

Page 18: 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

18

Learning Semantic Sub-Graphs Using Support Vector Machines & Experimental

• Sentence Extraction

Page 19: 1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

19

Conclusions

• This paper presented a novel approach of document summarization by generating semantic representation of the document and applying machine learning to extract a sub-graph that correspond to the semantic structure of a human extracted document summary

• Compared to human extracted summaries they achieve on average recall of 75% and precision of 30%