Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo...

Post on 14-Dec-2015

214 views 1 download

Tags:

Transcript of Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo...

THAI BROADCAST NEWS CORPUS CONSTRUCTION AND EVALUATION

Markpong Jongtaveesataporn †

Chai Wutiwiwatchai ‡

Koji Iwano †

Sadaoki Furui †

† Tokyo Institute of Technology, Japan ‡NECTEC, Thailand

Background on Thai speech recognition research

2

1987

Isolated syllable recogniti

on

1995

Isolated word

recognition

Connected sub-word

recognition

1999

Small task continuous

speech recognition

2003

LVCSR

2005

Broadcast news

transcription system

2007

Difficulty

Thienlikit et al., 2004• Newspaper read-speech recognition

Development of Thai Broadcast News Transcription System• Research on broadcast news transcription

system for Thai falls behind other languages• English: 1995 (Stern, 1997)• Japanese: 1997 (Matsuoka et al., 1997)• Mandarin: 1998 (Guo et al., 1998)• Italian: 2000 (Federico et al., 2000)

• We need to speed up our research activities to catch up with others

3

Targets

1. Development of Thai broadcast news corpus• Speech corpus: training and testing data• Text corpus: language modeling

2. Development of a prototype system

Speech corpus

Structure information of broadcast news was annotated Section, Speaker’s turn, Segments

Property tags were annotated to each speaker’s turn Speaker’s name, if known Speaker’s gender: male / female Speaking mode: planned / spontaneous Background noise: clean / music / noise

Only speech from announcers speaking in the studio was transcribed

Transcription and annotation was created by one transcriber and checked by another transcriber

4

Episode : one broadcast news session

Structure of broadcast news

5

Section 1 : one news topicSection 1 : one news topic

Section 2

Section 3

Episode : one broadcast news session

Section 1 : one news topic

Structure of broadcast news

5

Speaker’s turn : speaker ASpeaker’s turn : speaker A

Speaker’s turn : speaker B

Speaker’s turn : speaker A

Episode : one broadcast news session

Structure of broadcast news

7

Section 1 : one news topic

Speaker’s turn : speaker A

Segment : one sentence or clause

Segment : one sentence or clause

Segment : one sentence or clause

Speech corpus

Structure information of broadcast news was annotated Section, Speaker’s turn, Segments

Property tags were annotated to each speaker’s turn Speaker’s name, if known Speaker’s gender: male / female Speaking mode: planned / spontaneous Background noise: clean / music / noise

Only speech from announcers speaking in the studio was transcribed

Transcription and annotation was created by one transcriber and checked by another transcriber

8

Episode : one broadcast news session

Example of structure information

9

Section 1 :

Speaker’s turn :

Segment : sentence A

Segment : sentence B

Segment : sentence C

Sports

Mr. A, male, planned speech, clean speech

Speech corpus

Structure information of broadcast news was annotated Section, Speaker’s turn, Segments

Property tags were annotated to each speaker’s turn Speaker’s name, if known Speaker’s gender: male / female Speaking mode: planned / spontaneous Background noise: clean / music / noise

Only speech from announcers speaking in the studio was transcribed

Transcription and annotation was created by one transcriber and checked by another transcriber

10

Text corpus

No structure information was annotated

Additional information Speaking mode: planned / spontaneous

11

Problems of Thai transcription text No space between words Definition of word is very ambiguous No good morphological analyzer Difficulties in transcription and checking process

Manually word-segmented transcription was made Instruction was created for transcribers

Automatically segmented transcription

12

Future

target

Broadcast news collection

News programs from one public TV station in Thailand were recorded

Total of 105 news episodes Speech corpus : 35 news episodes 17

hours Text corpus : 70 news episodes

13

Analysis of speech corpus

14

Back-ground

Mode

Gender female male

planned

sponta-neous

noise clean music

Information of speech & text corpora

Attribute Speech corpusText

corpus

No. of sentences

13k 32k

No. of words 224k 573k

No. of unique words

10k 14k

No. of phonemes

899k -

No. of speakers8 female,

4 male-

15

Data used in experiments Test set data

Randomly selected from the speech corpus 3,000 utterances

Acoustic model training data for the baseline system Phonetically balanced sentence speech corpora

LOTUS (Kasuriya et al., 2003) and the corpus developed internally

Read speech corpora 40.3 hours (68 male and 68 female)

Acoustic model adaptation data Selected from the speech corpus No overlap between adaptation data and test set

data

Language model training data Text corpus + transcript from speech corpus

excluded test set

16

Experimental condition

Acoustic model Gender-dependent acoustic model 12 MFCCs, delta, and delta energy Triphones, 1000 tied-states, 8 Gaussian mixtures

Language model Tri-grams

Dictionary size: about 18k words TITech WFST speech recognition system

(Dixon et al., 2007) was used as a speech decoder

17

Acoustic model adaptation

Supervised adaptation using MLLR F-condition adaptation

F0 : clean, planned F1 : clean, spontaneousF3 : music noise F4 : other noise

Adaptation data: 200 utterances regardless of speaker randomly selected from the speech corpus

Speaker adaptation Adaptation data: 200 utterances regardless of

F-condition randomly selected from the speech corpus

18

F0 F1 F3 F4 Overall20

24

28

32

36

40

44

48

52

56

60

26.0

43.6

56.4

41.5 38.4

22.3

35.9 38.4

34.2 30.8

21.8

36.9 38.0

31.9 29.1

No adaptation F-condition adapt. Speaker adapt.

WER

(%

)WER results

19

Speaker adaptation yielded

better WER

F-condition

Proportion

Time #words

F0 35.3% 17160

F1 1.0% 629

F3 14.0% 7882

F4 49.7% 27542

Discussion

High WER Mismatch recording condition

The speech corpus was only used as testing and adaptation data

Small text corpus Inefficient language model

20

Conclusion

Construction of the first Thai broadcast news corpus and overview of the corpus analysis was presented

Speech corpus was annotated with structure information which is useful for further research purpose

An LVCSR system was setup and tested with the corpus

21

Future work

Applying our Thai language modeling technique (Jongtaveesataporn et al., 2007) Compound pseudo-morpheme (CPM) unit Pseudo-morpheme error rate (F0 condition)

Manually-segmented word unit system: 20.5% CPM unit system: 19.9%

Improving language model by using newspaper text

Collaboration with NECTEC: additional 50 hours of speech corpus

22

Thank you

23

Thank you

24

Thank you

25

Background

26

1987

Isolated syllable recogniti

on

1995

Isolated word

recognition

Connected sub-word

recognition

1999

Small task continuous

speech recognition

2003

LVCSR

2005

Broadcast

news LVCSR

2007

Difficulty

Thienlikit, 2004• Newspaper read-speech recognition

Development of Thai Broadcast News LVCSR System Development of an LVCSR system requires

speech and text corpora Existing speech corpora for Thai LVCSR

research NECTEC-ATR LOTUS (NECTEC) GlobalPhone (CMU)

27

Newspaper read-speech

1. Development of Thai broadcast news corpus• Speech corpus: training and testing

data• Text corpus: language modeling

2. Development of a prototype of LVCSR system

Experiments & Developed corpora Speech corpus

The size of the speech corpus is still rather small

It was used in three ways Test data Adaptation data A part of transcription text was used for

training LM

Text corpus It was used for training LM

28

Perplexity & OOV rates

F-conditio

n

Perplexity OOV rate

Male Female Male Female

F0 107.5 106.9 0.9 0.8

F1 126.4 100.1 0.9 0.6

F3 145.2 100.0 0.7 0.9

F4 141.6 157.6 1.5 1.9

Overall 126.9 125.6 1.2 1.3

29

Transcription processText corpus transcribing7 persons

Guideline

30

Speech corpus transcribing4 persons

Speech corpus checking2 persons

Lexical entries checking1 person

Speech corpus

Lexical entries checking1 person

Text corpus

Speech corpus

Transcription and annotation of about 17 hours of TV broadcast news

Tool: “Transcriber” (Barras et al., 2001)

Additional information speaker information: name, gender speaking mode: planned/spontaneous

speech Speech from announcers speaking in

the studio31

Transcription conventions

Guideline for the transcription process Segment segmentation Word segmentation Repeating word Thai/English abbreviation Number entity Special tags

32

Introduction

Thai speech processing research in TokyoTech Dialogue system [Whittiwiwattchai, 2003] LVCSR system

Dictation system [Tianlikid,2005] Broadcast news recognition system

33

Overview

Introduction Corpus description Recording and transcription

processes Corpus evaluation Conclusion

34

Thai language corpora

Large language corpora are crucial to a state-of-the-art natural language processing system

Thai speech resources for speech processing NECTEC-ATR LOTUS (NECTEC) GlobalPhone (CMU) TSynC-1 (NECTEC)

35

Newspaper read-speech

Unit-selection speech synthesis

WER Result

F-conditionTime

proportion

WER (%)

Male Female

F0 28.1% 44.4 40.8

F1 1.5% 62.4 60.2

F3 11.5% 82.2 72.4

F4 58.9% 54.9 57.5

Overall 100% 56.8 45.5

36

Text corpus

Text transcribed from 35 hours of TV broadcast news

Additional information Speaking mode: planned/spontaneous

37

Transcription conventions (1) Sentence segmentation

No sentence marker in Thai language Ambiguous Grammatically, there are 3 types of

sentence Simple sentence Compound sentence Complex sentence

Sentence was defined as a simple sentence or clause with the help of delimited breaths

38

Composed from several of clauses or simple sentences

Transcription conventions (2) Word segmentation

No word boundary marker in Thai language

Lead to difficulties in transcription and data checking processes

Too ambiguous to define all rules A few rules of simple segmentation

patterns were defined Undefined patterns were left to the

decision of transcribers

39

Transcription conventions (3) Repeating word Thai/English abbreviation Number entity Special tags

Disfluencies, filled-pauses, exclamations Foreign words Some other events: uncertainly

transcribed part, etc.

40

Recorded programs

News programs from one public TV station in Thailand was recorded

Total of 105 news episodes Speech corpus

35 news episodes About 17 hours of speech data

Text corpus: 70 news episodes

41