The kyutech corpus and topic segmentation using a combined method

32
The Kyutech corpus and topic segmentation using a combined method 1 Takashi Yamamura, Kazutaka Shimada and Shintaro Kawahara Kyushu Institute of Technology The Kyutech corpus and topic segmentation using a combined method

Transcript of The kyutech corpus and topic segmentation using a combined method

Page 1: The kyutech corpus and topic segmentation using a combined method

The Kyutech corpusand topic segmentation

using a combined method

1

Takashi Yamamura, Kazutaka Shimada and Shintaro Kawahara

Kyushu Institute of TechnologyThe Kyutech corpus and topic segmentation using a combined method

Page 2: The kyutech corpus and topic segmentation using a combined method

2The Kyutech corpus and topic segmentation using a combined method

Today’s Topic▶ 1. Open the Kyutech corpus

Japanese conversation corpus about a decision-making task

The first Japanese corpus for summarizationFreely available to anyone one the web

▶ 2. Evaluate three topic segmentation methodsTopic segmentation has an important role in the meeting summarization.

Previous study : LCSeg and TopicTilingThe combined methods based on LCSeg and TopicTiling

Introduction

Page 3: The kyutech corpus and topic segmentation using a combined method

3The Kyutech corpus and topic segmentation using a combined method

Multi-party Conversation Understanding

▶ Summarization of multi-party conversationUseful to understand the content of conversation

▶ There are some meeting corpora in English.The AMI Corpus (Carletta, 2007)The ICSI Corpus (Janin et al., 2003)

▶ ProblemsNo Japanese Meeting corpus for summarization

Background

Release “the Kyutech corpus”

Page 4: The kyutech corpus and topic segmentation using a combined method

4The Kyutech corpus and topic segmentation using a combined method

Outline▶ The Kyutech corpus

9 conversations• 4 scenarios with different settings

The Kyutech corpus

Conversation

Page 5: The kyutech corpus and topic segmentation using a combined method

5The Kyutech corpus and topic segmentation using a combined method

Outline▶ The Kyutech corpus

9 conversations• 4 scenarios with different settings

Transcription• Transcription of the conversation

The Kyutech corpus

Transcription

Speaker UtteranceA U1B U2D U3C U4D U5

Page 6: The kyutech corpus and topic segmentation using a combined method

6The Kyutech corpus and topic segmentation using a combined method

Outline▶ The Kyutech corpus

9 conversations• 4 scenarios with different settings

Transcription• Transcription of the conversation

Topic annotation• Annotation of topic tags for each utterance

The Kyutech corpus

Topic annotation

A Topic U1 B Topic U2 D Topic U3 C Topic U4 D Topic U5

Add Topic

Page 7: The kyutech corpus and topic segmentation using a combined method

7The Kyutech corpus and topic segmentation using a combined method

Outline▶ The Kyutech corpus

9 conversations• 4 scenarios with different settings

Transcription• Transcription of the conversation

Topic annotation• Annotation of topic tags for each utterance

Reference summary generation• Abstractive hand summaries

The Kyutech corpus

Reference SummaryA Topic

U1B Topic U2D Topic U3C Topic U4D Topic U5

Summary

Page 8: The kyutech corpus and topic segmentation using a combined method

8The Kyutech corpus and topic segmentation using a combined method

Task▶ A decision-making task with four

participantsDetermine a new restaurant in a virtual shopping mall

Discussion based on the document• Candidate and existing restaurants in the shopping mall• Statistics information about the mall (e.g. target

customers)

The Kyutech corpus

Three candidate restaurantsGender distribution

of hourly target customers

Page 9: The kyutech corpus and topic segmentation using a combined method

9The Kyutech corpus and topic segmentation using a combined method

Transcription▶ The transcription of the conversations

Separated utterances by 0.2-sec interval

The Kyutech corpus

Speaker Start End UtteranceD 00:24.490 00:25.530 (F ahh), in this condition

+D 00:26.585 00:27.615 which one is suitable (Q)

/C 00:29.985 00:31.195 I think the ramen is

better / A 00:31.815 00:33.965 me too /

Page 10: The kyutech corpus and topic segmentation using a combined method

10The Kyutech corpus and topic segmentation using a combined method

Transcription▶ The transcription of the conversations

Separated utterances by 0.2-sec intervalAnnotated some tags (e.g. filler, falter, question)

The Kyutech corpus

Speaker Start End UtteranceD 00:24.490 00:25.530 (F ahh), in this condition

+D 00:26.585 00:27.615 which one is suitable (Q)

/C 00:29.985 00:31.195 I think the ramen is

better / A 00:31.815 00:33.965 me too /

(F) : Filler

(Q) : Question

Page 11: The kyutech corpus and topic segmentation using a combined method

11The Kyutech corpus and topic segmentation using a combined method

Transcription▶ The transcription of the conversations

Separated utterances by 0.2-sec intervalAnnotated some tags (e.g. filler, falter, question)Added tags for sentence-level identification

The Kyutech corpus

Speaker Start End UtteranceD 00:24.490 00:25.530 (F ahh), in this condition

+D 00:26.585 00:27.615 which one is suitable (Q)

/C 00:29.985 00:31.195 I think the ramen is

better / A 00:31.815 00:33.965 me too /

𝑺𝟏

𝑺𝟐𝑺𝟑

Links to the next utterance

Page 12: The kyutech corpus and topic segmentation using a combined method

12The Kyutech corpus and topic segmentation using a combined method

Topic Annotation▶ Annotation of topic tags for each

utteranceIt is important to consider topics in summarization.

Topic tags• Express a topic of an utterance• Created 28 topic tags by 4 annotators including the

authors

The Kyutech corpus

CandX Closed Exist4 ClEx Area Atomos AccessCandY Exist1 Exist5 Mall People Time Meeting

CandZ Exist2 Exist6OtherMa

ll Price Seat ChatCandS Exist3 Exists Location Menu Sell Vague

Page 13: The kyutech corpus and topic segmentation using a combined method

13The Kyutech corpus and topic segmentation using a combined method

Topic Annotation▶ Annotation of topic tags for each

utteranceIt is important to consider topics in summarization.

Topic tags• Express a topic of an utterance• Created 28 topic tags by 4 annotators including the

authors

The Kyutech corpus

CandX Closed Exist4 ClEx Area Atomos AccessCandY Exist1 Exist5 Mall People Time Meeting

CandZ Exist2 Exist6OtherMa

ll Price Seat ChatCandS Exist3 Exists Location Menu Sell Vague

the existing or closed restaurants

the candidate restaurants

Page 14: The kyutech corpus and topic segmentation using a combined method

14The Kyutech corpus and topic segmentation using a combined method

Topic Annotation▶ Annotation of topic tags for each

utteranceIt is important to consider topics in summarization.

Topic tags• Express a topic of an utterance• Created 28 topic tags by 4 annotators including the

authors

The Kyutech corpus

CandX Closed Exist4 ClEx Area Atomos AccessCandY Exist1 Exist5 Mall People Time Meeting

CandZ Exist2 Exist6OtherMa

ll Price Seat ChatCandS Exist3 Exists Location Menu Sell Vague

the shopping mall

the details of the restaurant

Page 15: The kyutech corpus and topic segmentation using a combined method

15The Kyutech corpus and topic segmentation using a combined method

Topic Annotation▶ Annotation of topic tags for each

utteranceIt is important to consider topics in summarization.

Topic tags• Express a topic of an utterance• Created 28 topic tags by 4 annotators including the

authors

The Kyutech corpus

CandX Closed Exist4 ClEx Area Atomos AccessCandY Exist1 Exist5 Mall People Time Meeting

CandZ Exist2 Exist6OtherMa

ll Price Seat ChatCandS Exist3 Exists Location Menu Sell Vague

- The proceedings and final decision- Not related to the task- Others and unknown

Page 16: The kyutech corpus and topic segmentation using a combined method

16The Kyutech corpus and topic segmentation using a combined method

Annotation of Topic Tags▶ Multiple topic tags for each utterance

Main tag• Essential topic tags : main topic of an utterance

Additional tag• Optional topic tags : more detailed topic in the main topic

Topic Annotation

Page 17: The kyutech corpus and topic segmentation using a combined method

17The Kyutech corpus and topic segmentation using a combined method

Annotation of Topic Tags▶ Multiple topic tags for each utterance

Main tag• Essential topic tags : main topic of an utterance

Additional tag• Optional topic tags : more detailed topic in the main topic

Topic Annotation

ID Main Addition Utterance

A Exist1  what do you think of “Kaibutsu” (Q) /

C Exist1 Menuit has a wide variety on the menu /

Discussion about the menu of the existing restaurant

Exist1 : the existing

restaurant 1

Page 18: The kyutech corpus and topic segmentation using a combined method

18The Kyutech corpus and topic segmentation using a combined method

Annotation of Topic Tags▶ Multiple topic tags for each utterance

Main tag• Essential topic tags : main topic of an utterance

Additional tag• Optional topic tags : more detailed topic in the main topic

Topic Annotation

ID Main Addition Utterance

A Exist1  what do you think of “Kaibutsu” (Q) /

C Exist1 Menuit has a wide variety on the menu /ID Main Addition Utterance

B Menu Exist1 in the point of view of menu, “Kaibutsu” looks good /

D Menu Exist2 I wonder “FamilyPlate” looks good, also /

Discussion about the menu of the existing restaurant

Discussion about the existing restaurants in the point of view of menu

Exist1 : the existing

restaurant 1

Exist2 : the existing

restaurant 2

Page 19: The kyutech corpus and topic segmentation using a combined method

19The Kyutech corpus and topic segmentation using a combined method

Process▶ Topic annotation process

Step1 : annotation by 2 annotators• Investigate the result of topic annotation by 2

annotatorsStep2 : final judgment of each tag by 3 authors

The Kyutech corpus

Page 20: The kyutech corpus and topic segmentation using a combined method

20The Kyutech corpus and topic segmentation using a combined method

Step1 : Annotation by 2 annotators

▶ Main tag and Additional tag for each utteranceEach annotator selects at least one suitable topic tag.

Topic Annotation

  Annotator1 Annotator2  ID Main Add Main Add Utterance

A Exist4 Sell Exist4 Sell …... "FamilyPlate" made the biggest sale in the restaurants +

D Exist4 Sell Exist4   (L uhn) /A Exist4 Sell Meeting   and the restaurant is … +A Exist4 Sell Meeting   the reason, what is the reason (Q)

/D Exist4 Menu People   many menus and branches (?

Maybe) /

Main tag : essentialAdditional tag : optional

CandX Closed Exist4 ClEx Area Atomos AccessCandY Exist1 Exist5 Mall People Time Meeting

CandZ Exist2 Exist6OtherMa

ll Price Seat ChatCandS Exist3 Exists Location Menu Sell Vague

CandX Closed Exist4 ClEx Area Atomos AccessCandY Exist1 Exist5 Mall People Time Meeting

CandZ Exist2 Exist6OtherMa

ll Price Seat ChatCandS Exist3 Exists Location Menu Sell Vague

Page 21: The kyutech corpus and topic segmentation using a combined method

21The Kyutech corpus and topic segmentation using a combined method

Step1 : Annotation by 2 annotators

▶ Main tag and Additional tag for each utteranceEach annotator selects at least one suitable topic tag.

▶ The Agreement Score between 2 annotatorsThe rate that the same tag from 2 annotators is included• 0.879

Topic Annotation

  Annotator1 Annotator2  ID Main Add Main Add Utterance

A Exist4 Sell Exist4 Sell …... "FamilyPlate" made the biggest sale in the restaurants +

D Exist4 Sell Exist4   (L uhn) /A Exist4 Sell Meeting   and the restaurant is … +A Exist4 Sell Meeting   the reason, what is the reason (Q)

/D Exist4 Menu People   many menus and branches (?

Maybe) /

Page 22: The kyutech corpus and topic segmentation using a combined method

22The Kyutech corpus and topic segmentation using a combined method

Step2 : Final judgment of each tag

▶ Determination of the final tags by authorsBased on each topic tag from annotatorsExtension : Main tag and 2 Additional tags

▶ The Agreement ScoreThe rate that contains one or more tags from annotators• 0.965

Topic Annotation

  Annotator1 Annotator2ID Main Addition Main AdditionD Exist4 Menu Exist4  A Exist4 Menu People  A Exist4 People People  

Final tag

MainAddition

1Addition

2Exist4 Menu  Exist4 Menu  Exist4 Menu People

Modified topic tags

Page 23: The kyutech corpus and topic segmentation using a combined method

23The Kyutech corpus and topic segmentation using a combined method

Reference summary▶ Reference summary of the conversation

Size : 250 characters to 500 charactersBased on the guideline of the AMI corpus

The Kyutech corpus

Understandable for somebodywho was not present during the meeting

Page 24: The kyutech corpus and topic segmentation using a combined method

24The Kyutech corpus and topic segmentation using a combined method

Data▶ Open Resources

9 conversation (total utterances : 4,509)TranscriptionTopic annotationReference summaries

▶ Currently Unpublished ResourcesQuestionnaireAudio-visual data recoding the conversation• A four-direction camera and a video camera

The Kyutech corpus

http://www.pluto.ai.kyutech.ac.jp/~shimada/resources.html

Page 25: The kyutech corpus and topic segmentation using a combined method

25The Kyutech corpus and topic segmentation using a combined method

Today’s Topic▶ 1. Open the Kyutech corpus

Japanese conversation corpus about a decision-making task

The first Japanese corpus for summarizationFreely available to anyone one the web

▶ 2. Evaluate three topic segmentation methodsTopic segmentation has an important role in the meeting summarization.

Previous study : LCSeg and TopicTilingThe combined methods based on LCSeg and TopicTiling

Introduction

Page 26: The kyutech corpus and topic segmentation using a combined method

26The Kyutech corpus and topic segmentation using a combined method

Outline▶ Topic segmentation

Divide the conversation into topic segmentsThe first process in conversation summarization• It is possible to generate a summary covering all topics.

- (Banerjee et al., 2015), (Oya et al., 2014)Previous study : LCSeg and Topic Tiling

Background

Summary (Topic1)Summary (Topic2)

Summary (TopicN)

Topic Segmentation Summary Generation

The Kyutech corpus

:

Topic Segments Final Summaries

Page 27: The kyutech corpus and topic segmentation using a combined method

27The Kyutech corpus and topic segmentation using a combined method

LCSeg▶ Lexical Cohesion Segmentation (Galley et

al., 2003)Text segmentation method based on lexical cohesion

Compute cohesion between sentences with lexical chain• Lexical chain : chain of the same word

Topic Segmentation

ID Utterance

CI guess “Kaibutsu" is suitable as the new restaurants /

A I'm with you, “Kaibutsu" is better /A “Kaibutsu" has a wide variety on the menu /C right, there are many menus /= = = = = = = SEGMENT = = = =

= = =B I guess so, but I suppose Chinese food is better /D I'd prefer to Chinese food /Segment at the break of lexical chains

= = = =

Page 28: The kyutech corpus and topic segmentation using a combined method

28The Kyutech corpus and topic segmentation using a combined method

TopicTiling▶ TopicTiling (Riedl and Biemann, 2012)

Text segmentation method using the LDA topic model

Topic model• Assume that one document has multiple topics

Latent Dirichlet Allocation (LDA)• Estimate the word distributions representing topics

Topic Segmentation

Sentence topic1 topic2 topic3 topicN = {0.05, 0.10, 0.02, ・・・ , 0.04} = {0.20, 0.01, 0.20, ・・・ , 0.07} : :

Page 29: The kyutech corpus and topic segmentation using a combined method

29The Kyutech corpus and topic segmentation using a combined method

Combined Method▶ Combine LCSeg and TopicTiling

Merge the characteristics of methods (word, topic model)

Use the cohesion between sentences of each method

Compute a new score with a weight factor • : a trade-off parameter

Topic Segmentation

𝐜𝐨𝐬𝑪 ( 𝑨 ,𝑩)=𝒘𝒇 ×𝐜𝐨𝐬𝑳 ( 𝑨 ,𝑩 )+(𝟏−𝒘𝒇 )×𝐜𝐨𝐬𝑻 (𝑨 ,𝑩)

Combined LCSeg TopicTiling

: Increase in the weight of LCSeg : Increase in the weight of TopicTiling

Page 30: The kyutech corpus and topic segmentation using a combined method

30The Kyutech corpus and topic segmentation using a combined method

Experiment▶ Data set

The Kyutech corpus: 8 conversations• Excluding one conversation as the development data

▶ Two criteriaThe F-measure of the complete and partial matching

Topic Segmentation

ID MainA MeetingA MeetingA SellD SellA SellD SellA Sell

The complete matching

The partial matching

Page 31: The kyutech corpus and topic segmentation using a combined method

31The Kyutech corpus and topic segmentation using a combined method

Experimental Result

▶ LCSeg > Combined > TopicTilingTopicTiling-based methods were low accuracy• The size of the Kyutech corpus is not enough to apply the statistical methods

Topic Segmentation

The number of topics in LDA

The value of the weight factor

Page 32: The kyutech corpus and topic segmentation using a combined method

The Kyutech corpus and topic segmentation using a combined method

Summary and Future Work▶ Summary

Release the Kyutech corpus• The first Japanese conversation corpus for summarization

Evaluate three topic segmentation methods• Combine two different text segmentation methods

▶ Future workScaling up the Kyutech corpusOther annotations for summarization• Dialogue-acts (communicative functions (Bunt, 2000) )

Abstractive summarization using the segmented topics

END

Conclusion