Roy_p71

17
August 29, 2022 IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data IBM Research Adding Sentence Boundaries to Conversational Speech Transcriptions using Noisily Labelled Examples Tetsuya Nasukawa, IBM Tokyo Research Lab Diwakar Punjani, IBM India Research Lab Shourya Roy , IBM India Research Lab L V Subramaniam , IBM India Research Lab Hironori Takeuchi, IBM Tokyo Research Lab Presented by : Shourya Roy

Transcript of Roy_p71

Page 1: Roy_p71

April 12, 2023 IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data

IBM Research

Adding Sentence Boundaries to Conversational Speech Transcriptions using Noisily Labelled Examples

Tetsuya Nasukawa, IBM Tokyo Research LabDiwakar Punjani, IBM India Research LabShourya Roy , IBM India Research LabL V Subramaniam , IBM India Research LabHironori Takeuchi, IBM Tokyo Research Lab

Presented by : Shourya Roy

Page 2: Roy_p71

April 12, 2023 IBM Research IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data

What are We Trying to do?

Automatically identifying sentence boundaries in noisy transcriptions of conversational data. Transcriptions can be manual or automatic (ASR) It can work without any manual supervision

The accuracy improves with manual supervision Detects only periods – not comma,

semicolon

Page 3: Roy_p71

April 12, 2023 IBM Research IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data

Importance – One Motivating Example from Real Life Huge amount of telephonic conversational

data produced in various domains such as CRM, BPO

Important to analyze to improve customer satisfaction, agent productivity, market reputation NLP techniques on transcriptions is an obvious

approach Transcriptions are noisy and does not

contain any punctuation marks POS taggers and syntactic parsers perform

poorly in absence of sentence boundaries

Importance of analysis of

transcriptions

Importance of sentence boundary

detection for transcriptions analy

sis

Page 4: Roy_p71

April 12, 2023 IBM Research IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data

Why Non Trivial

Noise in the dataset Spontaneous nature of conversation Variation in style of speaking Boundary density varies from call to

call Removing the calls with very low

boundary density improves the scores by approx. 10%

Page 5: Roy_p71

April 12, 2023 IBM Research IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data

Existing Solutions

SBD on conversational data – not many work Based on Pause (Silence) Information

Page 6: Roy_p71

April 12, 2023 IBM Research IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data

Example: Manual Transcription

64.88 67.59 A: i've i've barely been out of the country. i wouldn't {breath} 65.10 67.16 B: {lipsmack} {breath} 67.64 71.26 A: i think my most memorable trip was when i was in high school.70.57 71.81 B: {breath} uh-huh.71.69 74.29 A: i went to %uh ^London and ^Paris.74.29 75.01 B: %oh that's cool.74.82 76.80 A: and that's about as exotic as it ever got.76.75 77.76 B: {breath} was it fun?77.49 79.95 A: %uh other than that, i haven't been west of ^Texas80.04 80.44 B: %hm.81.31 83.63 B: {breath} it looks like you are a east *coaster born and raised.84.02 86.14 A: yeah. how about yourself? where are you?86.74 87.38 B: {breath} i'm in ^Philly87.72 90.78 A: you're in ^Philly, i guess? i wonder if everybody here is in ^Philly? probably. {breath} 88.57 89.01 B: yeah.90.82 94.68 B: yeah, i think so because it's a ~U ^Penn thing. they probably just did it locally. plus 94.80 96.69 B: %uh are you using an ^Omnipoint phone?96.82 97.23 A: uh-huh

Timing

Speaker

Meta InfoNames of Places

Page 7: Roy_p71

April 12, 2023 IBM Research IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data

Example : Automatic Transcription

then go to properties ok now once when you go to properties up if you scroll down there that he's having internet protocol ok you have to no i'm sorry just any scroll down that you're having a net firewall so that's no we have to check if there's a check next to it ok if it's not checked you have to get a check that ok and if if you do not so if you are calling you having a check all you have to do is i can check the net firewalls so this ok and you have to go ahead and reboot the system

Page 8: Roy_p71

April 12, 2023 IBM Research IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data

Example

then go to properties ok now once when you go to properties up if you scroll down there that he's having internet protocol ok you have to no i'm sorry just any scroll down that you're having a net firewall so that's no we have to check if there's a check next to it ok if it's not checked you have to get a check that ok and if if you do not so if you are calling you having a check all you have to do is i can check the net firewalls so this ok and you have to go ahead and reboot the system

Page 9: Roy_p71

April 12, 2023 IBM Research IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data

Summary of Proposed Technique

From (possibly imprecisely) marked sentence boundaries in conversational data identify n-grams which are more likely to occur at sentence boundaries than inside the sentence

Mark sentence boundaries before (or after) head or (tail) n-grams in test data

Page 10: Roy_p71

April 12, 2023 IBM Research IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data

Technique

Preprocessing of data Pause filling words, repetitions, unclear words

are removed Identify frequent head and tail n-grams

from training data which occur in beginning and ending of sentences

Filter n-grams which also occur significant number of times in middle of the sentences Threshold on head/tail:middle of sentence ratio

Handle interruption and continuation across turns separately Words indicating incomplete turn e.g. get, and

Page 11: Roy_p71

April 12, 2023 IBM Research IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data

Technique (Contd.)

In the test set mark a boundary before every head n-gram and after every tail n-gram In the case of boundaries marked based

on silence information on ASR data, add new sentence boundaries

If the turn does not end with a word from the set of words indicating incomplete turn mark a boundary at the end of the turn

Page 12: Roy_p71

April 12, 2023 IBM Research IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data

Nature of Data

Manual Transcriptions Switchboard corpus and the Call-home

corpus of transcribed phone conversations from LDC

Automatic Transcriptions Manually put punctuations Automatically put punctuations based on

silence ASR transcribed calls from IBM

helpdeskData Statistics

Page 13: Roy_p71

April 12, 2023 IBM Research IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data

Results

Method Precision Recall F1 Word Error Rate (WER)

Only Silence 0.54 0.28 0.37 0.96

Only Head/Tail 0.78 0.55 0.65 0.60

Head/Tail + Silence

0.66 0.72 0.68 0.66

Head/Tail + Silence – FalseBoundaries

0.72 0.69 0.70 0.58

Result of punctuation insertion for helpdesk data

Incr

easi

ng

Decr

easi

ng

Page 14: Roy_p71

April 12, 2023 IBM Research IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data

Improvement in PoS Tagging

PoS Tagging Accuracy on Helpdesk Data

An example PoS tagging improving with sentence boundary detection

Ideally ‘i’ should be pronoun and ‘yeah’ and ‘oh’ should be interjection

Page 15: Roy_p71

April 12, 2023 IBM Research IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data

Improvement in PoS Tagging (Contd.)

Extracted top 10 Noun Phrases from Switchboard Data Set

Page 16: Roy_p71

April 12, 2023 IBM Research IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data

Summary

Fundamental operation to be performed to apply state-of-the-art NLP techniques on (automatic) transcriptions of conversations

We proposed a technique to train a sentence boundary detector with minimal manual supervision

It would be interesting to see how much improvement is happening in actual extraction task!

Page 17: Roy_p71

April 12, 2023 IBM Research IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data

Questions?