Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo...

41
THAI BROADCAST NEWS CORPUS CONSTRUCTION AND EVALUATION Markpong Jongtaveesataporn Chai Wutiwiwatchai Koji Iwano Sadaoki Furui Tokyo Institute of Technology, Japan NECTEC, Thailand

Transcript of Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo...

Page 1: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

THAI BROADCAST NEWS CORPUS CONSTRUCTION AND EVALUATION

Markpong Jongtaveesataporn †

Chai Wutiwiwatchai ‡

Koji Iwano †

Sadaoki Furui †

† Tokyo Institute of Technology, Japan ‡NECTEC, Thailand

Page 2: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Background on Thai speech recognition research

2

1987

Isolated syllable recogniti

on

1995

Isolated word

recognition

Connected sub-word

recognition

1999

Small task continuous

speech recognition

2003

LVCSR

2005

Broadcast news

transcription system

2007

Difficulty

Thienlikit et al., 2004• Newspaper read-speech recognition

Page 3: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Development of Thai Broadcast News Transcription System• Research on broadcast news transcription

system for Thai falls behind other languages• English: 1995 (Stern, 1997)• Japanese: 1997 (Matsuoka et al., 1997)• Mandarin: 1998 (Guo et al., 1998)• Italian: 2000 (Federico et al., 2000)

• We need to speed up our research activities to catch up with others

3

Targets

1. Development of Thai broadcast news corpus• Speech corpus: training and testing data• Text corpus: language modeling

2. Development of a prototype system

Page 4: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Speech corpus

Structure information of broadcast news was annotated Section, Speaker’s turn, Segments

Property tags were annotated to each speaker’s turn Speaker’s name, if known Speaker’s gender: male / female Speaking mode: planned / spontaneous Background noise: clean / music / noise

Only speech from announcers speaking in the studio was transcribed

Transcription and annotation was created by one transcriber and checked by another transcriber

4

Page 5: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Episode : one broadcast news session

Structure of broadcast news

5

Section 1 : one news topicSection 1 : one news topic

Section 2

Section 3

Page 6: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Episode : one broadcast news session

Section 1 : one news topic

Structure of broadcast news

5

Speaker’s turn : speaker ASpeaker’s turn : speaker A

Speaker’s turn : speaker B

Speaker’s turn : speaker A

Page 7: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Episode : one broadcast news session

Structure of broadcast news

7

Section 1 : one news topic

Speaker’s turn : speaker A

Segment : one sentence or clause

Segment : one sentence or clause

Segment : one sentence or clause

Page 8: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Speech corpus

Structure information of broadcast news was annotated Section, Speaker’s turn, Segments

Property tags were annotated to each speaker’s turn Speaker’s name, if known Speaker’s gender: male / female Speaking mode: planned / spontaneous Background noise: clean / music / noise

Only speech from announcers speaking in the studio was transcribed

Transcription and annotation was created by one transcriber and checked by another transcriber

8

Page 9: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Episode : one broadcast news session

Example of structure information

9

Section 1 :

Speaker’s turn :

Segment : sentence A

Segment : sentence B

Segment : sentence C

Sports

Mr. A, male, planned speech, clean speech

Page 10: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Speech corpus

Structure information of broadcast news was annotated Section, Speaker’s turn, Segments

Property tags were annotated to each speaker’s turn Speaker’s name, if known Speaker’s gender: male / female Speaking mode: planned / spontaneous Background noise: clean / music / noise

Only speech from announcers speaking in the studio was transcribed

Transcription and annotation was created by one transcriber and checked by another transcriber

10

Page 11: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Text corpus

No structure information was annotated

Additional information Speaking mode: planned / spontaneous

11

Page 12: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Problems of Thai transcription text No space between words Definition of word is very ambiguous No good morphological analyzer Difficulties in transcription and checking process

Manually word-segmented transcription was made Instruction was created for transcribers

Automatically segmented transcription

12

Future

target

Page 13: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Broadcast news collection

News programs from one public TV station in Thailand were recorded

Total of 105 news episodes Speech corpus : 35 news episodes 17

hours Text corpus : 70 news episodes

13

Page 14: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Analysis of speech corpus

14

Back-ground

Mode

Gender female male

planned

sponta-neous

noise clean music

Page 15: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Information of speech & text corpora

Attribute Speech corpusText

corpus

No. of sentences

13k 32k

No. of words 224k 573k

No. of unique words

10k 14k

No. of phonemes

899k -

No. of speakers8 female,

4 male-

15

Page 16: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Data used in experiments Test set data

Randomly selected from the speech corpus 3,000 utterances

Acoustic model training data for the baseline system Phonetically balanced sentence speech corpora

LOTUS (Kasuriya et al., 2003) and the corpus developed internally

Read speech corpora 40.3 hours (68 male and 68 female)

Acoustic model adaptation data Selected from the speech corpus No overlap between adaptation data and test set

data

Language model training data Text corpus + transcript from speech corpus

excluded test set

16

Page 17: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Experimental condition

Acoustic model Gender-dependent acoustic model 12 MFCCs, delta, and delta energy Triphones, 1000 tied-states, 8 Gaussian mixtures

Language model Tri-grams

Dictionary size: about 18k words TITech WFST speech recognition system

(Dixon et al., 2007) was used as a speech decoder

17

Page 18: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Acoustic model adaptation

Supervised adaptation using MLLR F-condition adaptation

F0 : clean, planned F1 : clean, spontaneousF3 : music noise F4 : other noise

Adaptation data: 200 utterances regardless of speaker randomly selected from the speech corpus

Speaker adaptation Adaptation data: 200 utterances regardless of

F-condition randomly selected from the speech corpus

18

Page 19: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

F0 F1 F3 F4 Overall20

24

28

32

36

40

44

48

52

56

60

26.0

43.6

56.4

41.5 38.4

22.3

35.9 38.4

34.2 30.8

21.8

36.9 38.0

31.9 29.1

No adaptation F-condition adapt. Speaker adapt.

WER

(%

)WER results

19

Speaker adaptation yielded

better WER

F-condition

Proportion

Time #words

F0 35.3% 17160

F1 1.0% 629

F3 14.0% 7882

F4 49.7% 27542

Page 20: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Discussion

High WER Mismatch recording condition

The speech corpus was only used as testing and adaptation data

Small text corpus Inefficient language model

20

Page 21: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Conclusion

Construction of the first Thai broadcast news corpus and overview of the corpus analysis was presented

Speech corpus was annotated with structure information which is useful for further research purpose

An LVCSR system was setup and tested with the corpus

21

Page 22: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Future work

Applying our Thai language modeling technique (Jongtaveesataporn et al., 2007) Compound pseudo-morpheme (CPM) unit Pseudo-morpheme error rate (F0 condition)

Manually-segmented word unit system: 20.5% CPM unit system: 19.9%

Improving language model by using newspaper text

Collaboration with NECTEC: additional 50 hours of speech corpus

22

Page 23: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Thank you

23

Page 24: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Thank you

24

Page 25: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Thank you

25

Page 26: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Background

26

1987

Isolated syllable recogniti

on

1995

Isolated word

recognition

Connected sub-word

recognition

1999

Small task continuous

speech recognition

2003

LVCSR

2005

Broadcast

news LVCSR

2007

Difficulty

Thienlikit, 2004• Newspaper read-speech recognition

Page 27: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Development of Thai Broadcast News LVCSR System Development of an LVCSR system requires

speech and text corpora Existing speech corpora for Thai LVCSR

research NECTEC-ATR LOTUS (NECTEC) GlobalPhone (CMU)

27

Newspaper read-speech

1. Development of Thai broadcast news corpus• Speech corpus: training and testing

data• Text corpus: language modeling

2. Development of a prototype of LVCSR system

Page 28: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Experiments & Developed corpora Speech corpus

The size of the speech corpus is still rather small

It was used in three ways Test data Adaptation data A part of transcription text was used for

training LM

Text corpus It was used for training LM

28

Page 29: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Perplexity & OOV rates

F-conditio

n

Perplexity OOV rate

Male Female Male Female

F0 107.5 106.9 0.9 0.8

F1 126.4 100.1 0.9 0.6

F3 145.2 100.0 0.7 0.9

F4 141.6 157.6 1.5 1.9

Overall 126.9 125.6 1.2 1.3

29

Page 30: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Transcription processText corpus transcribing7 persons

Guideline

30

Speech corpus transcribing4 persons

Speech corpus checking2 persons

Lexical entries checking1 person

Speech corpus

Lexical entries checking1 person

Text corpus

Page 31: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Speech corpus

Transcription and annotation of about 17 hours of TV broadcast news

Tool: “Transcriber” (Barras et al., 2001)

Additional information speaker information: name, gender speaking mode: planned/spontaneous

speech Speech from announcers speaking in

the studio31

Page 32: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Transcription conventions

Guideline for the transcription process Segment segmentation Word segmentation Repeating word Thai/English abbreviation Number entity Special tags

32

Page 33: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Introduction

Thai speech processing research in TokyoTech Dialogue system [Whittiwiwattchai, 2003] LVCSR system

Dictation system [Tianlikid,2005] Broadcast news recognition system

33

Page 34: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Overview

Introduction Corpus description Recording and transcription

processes Corpus evaluation Conclusion

34

Page 35: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Thai language corpora

Large language corpora are crucial to a state-of-the-art natural language processing system

Thai speech resources for speech processing NECTEC-ATR LOTUS (NECTEC) GlobalPhone (CMU) TSynC-1 (NECTEC)

35

Newspaper read-speech

Unit-selection speech synthesis

Page 36: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

WER Result

F-conditionTime

proportion

WER (%)

Male Female

F0 28.1% 44.4 40.8

F1 1.5% 62.4 60.2

F3 11.5% 82.2 72.4

F4 58.9% 54.9 57.5

Overall 100% 56.8 45.5

36

Page 37: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Text corpus

Text transcribed from 35 hours of TV broadcast news

Additional information Speaking mode: planned/spontaneous

37

Page 38: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Transcription conventions (1) Sentence segmentation

No sentence marker in Thai language Ambiguous Grammatically, there are 3 types of

sentence Simple sentence Compound sentence Complex sentence

Sentence was defined as a simple sentence or clause with the help of delimited breaths

38

Composed from several of clauses or simple sentences

Page 39: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Transcription conventions (2) Word segmentation

No word boundary marker in Thai language

Lead to difficulties in transcription and data checking processes

Too ambiguous to define all rules A few rules of simple segmentation

patterns were defined Undefined patterns were left to the

decision of transcribers

39

Page 40: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Transcription conventions (3) Repeating word Thai/English abbreviation Number entity Special tags

Disfluencies, filled-pauses, exclamations Foreign words Some other events: uncertainly

transcribed part, etc.

40

Page 41: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Recorded programs

News programs from one public TV station in Thailand was recorded

Total of 105 news episodes Speech corpus

35 news episodes About 17 hours of speech data

Text corpus: 70 news episodes

41