Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo...
-
Upload
estefania-hanly -
Category
Documents
-
view
214 -
download
1
Transcript of Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo...
THAI BROADCAST NEWS CORPUS CONSTRUCTION AND EVALUATION
Markpong Jongtaveesataporn †
Chai Wutiwiwatchai ‡
Koji Iwano †
Sadaoki Furui †
† Tokyo Institute of Technology, Japan ‡NECTEC, Thailand
Background on Thai speech recognition research
2
1987
Isolated syllable recogniti
on
1995
Isolated word
recognition
Connected sub-word
recognition
1999
Small task continuous
speech recognition
2003
LVCSR
2005
Broadcast news
transcription system
2007
Difficulty
Thienlikit et al., 2004• Newspaper read-speech recognition
Development of Thai Broadcast News Transcription System• Research on broadcast news transcription
system for Thai falls behind other languages• English: 1995 (Stern, 1997)• Japanese: 1997 (Matsuoka et al., 1997)• Mandarin: 1998 (Guo et al., 1998)• Italian: 2000 (Federico et al., 2000)
• We need to speed up our research activities to catch up with others
3
Targets
1. Development of Thai broadcast news corpus• Speech corpus: training and testing data• Text corpus: language modeling
2. Development of a prototype system
Speech corpus
Structure information of broadcast news was annotated Section, Speaker’s turn, Segments
Property tags were annotated to each speaker’s turn Speaker’s name, if known Speaker’s gender: male / female Speaking mode: planned / spontaneous Background noise: clean / music / noise
Only speech from announcers speaking in the studio was transcribed
Transcription and annotation was created by one transcriber and checked by another transcriber
4
Episode : one broadcast news session
Structure of broadcast news
5
Section 1 : one news topicSection 1 : one news topic
Section 2
Section 3
Episode : one broadcast news session
Section 1 : one news topic
Structure of broadcast news
5
Speaker’s turn : speaker ASpeaker’s turn : speaker A
Speaker’s turn : speaker B
Speaker’s turn : speaker A
Episode : one broadcast news session
Structure of broadcast news
7
Section 1 : one news topic
Speaker’s turn : speaker A
Segment : one sentence or clause
Segment : one sentence or clause
Segment : one sentence or clause
Speech corpus
Structure information of broadcast news was annotated Section, Speaker’s turn, Segments
Property tags were annotated to each speaker’s turn Speaker’s name, if known Speaker’s gender: male / female Speaking mode: planned / spontaneous Background noise: clean / music / noise
Only speech from announcers speaking in the studio was transcribed
Transcription and annotation was created by one transcriber and checked by another transcriber
8
Episode : one broadcast news session
Example of structure information
9
Section 1 :
Speaker’s turn :
Segment : sentence A
Segment : sentence B
Segment : sentence C
Sports
Mr. A, male, planned speech, clean speech
Speech corpus
Structure information of broadcast news was annotated Section, Speaker’s turn, Segments
Property tags were annotated to each speaker’s turn Speaker’s name, if known Speaker’s gender: male / female Speaking mode: planned / spontaneous Background noise: clean / music / noise
Only speech from announcers speaking in the studio was transcribed
Transcription and annotation was created by one transcriber and checked by another transcriber
10
Text corpus
No structure information was annotated
Additional information Speaking mode: planned / spontaneous
11
Problems of Thai transcription text No space between words Definition of word is very ambiguous No good morphological analyzer Difficulties in transcription and checking process
Manually word-segmented transcription was made Instruction was created for transcribers
Automatically segmented transcription
12
Future
target
Broadcast news collection
News programs from one public TV station in Thailand were recorded
Total of 105 news episodes Speech corpus : 35 news episodes 17
hours Text corpus : 70 news episodes
13
Analysis of speech corpus
14
Back-ground
Mode
Gender female male
planned
sponta-neous
noise clean music
Information of speech & text corpora
Attribute Speech corpusText
corpus
No. of sentences
13k 32k
No. of words 224k 573k
No. of unique words
10k 14k
No. of phonemes
899k -
No. of speakers8 female,
4 male-
15
Data used in experiments Test set data
Randomly selected from the speech corpus 3,000 utterances
Acoustic model training data for the baseline system Phonetically balanced sentence speech corpora
LOTUS (Kasuriya et al., 2003) and the corpus developed internally
Read speech corpora 40.3 hours (68 male and 68 female)
Acoustic model adaptation data Selected from the speech corpus No overlap between adaptation data and test set
data
Language model training data Text corpus + transcript from speech corpus
excluded test set
16
Experimental condition
Acoustic model Gender-dependent acoustic model 12 MFCCs, delta, and delta energy Triphones, 1000 tied-states, 8 Gaussian mixtures
Language model Tri-grams
Dictionary size: about 18k words TITech WFST speech recognition system
(Dixon et al., 2007) was used as a speech decoder
17
Acoustic model adaptation
Supervised adaptation using MLLR F-condition adaptation
F0 : clean, planned F1 : clean, spontaneousF3 : music noise F4 : other noise
Adaptation data: 200 utterances regardless of speaker randomly selected from the speech corpus
Speaker adaptation Adaptation data: 200 utterances regardless of
F-condition randomly selected from the speech corpus
18
F0 F1 F3 F4 Overall20
24
28
32
36
40
44
48
52
56
60
26.0
43.6
56.4
41.5 38.4
22.3
35.9 38.4
34.2 30.8
21.8
36.9 38.0
31.9 29.1
No adaptation F-condition adapt. Speaker adapt.
WER
(%
)WER results
19
Speaker adaptation yielded
better WER
F-condition
Proportion
Time #words
F0 35.3% 17160
F1 1.0% 629
F3 14.0% 7882
F4 49.7% 27542
Discussion
High WER Mismatch recording condition
The speech corpus was only used as testing and adaptation data
Small text corpus Inefficient language model
20
Conclusion
Construction of the first Thai broadcast news corpus and overview of the corpus analysis was presented
Speech corpus was annotated with structure information which is useful for further research purpose
An LVCSR system was setup and tested with the corpus
21
Future work
Applying our Thai language modeling technique (Jongtaveesataporn et al., 2007) Compound pseudo-morpheme (CPM) unit Pseudo-morpheme error rate (F0 condition)
Manually-segmented word unit system: 20.5% CPM unit system: 19.9%
Improving language model by using newspaper text
Collaboration with NECTEC: additional 50 hours of speech corpus
22
Thank you
23
Thank you
24
Thank you
25
Background
26
1987
Isolated syllable recogniti
on
1995
Isolated word
recognition
Connected sub-word
recognition
1999
Small task continuous
speech recognition
2003
LVCSR
2005
Broadcast
news LVCSR
2007
Difficulty
Thienlikit, 2004• Newspaper read-speech recognition
Development of Thai Broadcast News LVCSR System Development of an LVCSR system requires
speech and text corpora Existing speech corpora for Thai LVCSR
research NECTEC-ATR LOTUS (NECTEC) GlobalPhone (CMU)
27
Newspaper read-speech
1. Development of Thai broadcast news corpus• Speech corpus: training and testing
data• Text corpus: language modeling
2. Development of a prototype of LVCSR system
Experiments & Developed corpora Speech corpus
The size of the speech corpus is still rather small
It was used in three ways Test data Adaptation data A part of transcription text was used for
training LM
Text corpus It was used for training LM
28
Perplexity & OOV rates
F-conditio
n
Perplexity OOV rate
Male Female Male Female
F0 107.5 106.9 0.9 0.8
F1 126.4 100.1 0.9 0.6
F3 145.2 100.0 0.7 0.9
F4 141.6 157.6 1.5 1.9
Overall 126.9 125.6 1.2 1.3
29
Transcription processText corpus transcribing7 persons
Guideline
30
Speech corpus transcribing4 persons
Speech corpus checking2 persons
Lexical entries checking1 person
Speech corpus
Lexical entries checking1 person
Text corpus
Speech corpus
Transcription and annotation of about 17 hours of TV broadcast news
Tool: “Transcriber” (Barras et al., 2001)
Additional information speaker information: name, gender speaking mode: planned/spontaneous
speech Speech from announcers speaking in
the studio31
Transcription conventions
Guideline for the transcription process Segment segmentation Word segmentation Repeating word Thai/English abbreviation Number entity Special tags
32
Introduction
Thai speech processing research in TokyoTech Dialogue system [Whittiwiwattchai, 2003] LVCSR system
Dictation system [Tianlikid,2005] Broadcast news recognition system
33
Overview
Introduction Corpus description Recording and transcription
processes Corpus evaluation Conclusion
34
Thai language corpora
Large language corpora are crucial to a state-of-the-art natural language processing system
Thai speech resources for speech processing NECTEC-ATR LOTUS (NECTEC) GlobalPhone (CMU) TSynC-1 (NECTEC)
35
Newspaper read-speech
Unit-selection speech synthesis
WER Result
F-conditionTime
proportion
WER (%)
Male Female
F0 28.1% 44.4 40.8
F1 1.5% 62.4 60.2
F3 11.5% 82.2 72.4
F4 58.9% 54.9 57.5
Overall 100% 56.8 45.5
36
Text corpus
Text transcribed from 35 hours of TV broadcast news
Additional information Speaking mode: planned/spontaneous
37
Transcription conventions (1) Sentence segmentation
No sentence marker in Thai language Ambiguous Grammatically, there are 3 types of
sentence Simple sentence Compound sentence Complex sentence
Sentence was defined as a simple sentence or clause with the help of delimited breaths
38
Composed from several of clauses or simple sentences
Transcription conventions (2) Word segmentation
No word boundary marker in Thai language
Lead to difficulties in transcription and data checking processes
Too ambiguous to define all rules A few rules of simple segmentation
patterns were defined Undefined patterns were left to the
decision of transcribers
39
Transcription conventions (3) Repeating word Thai/English abbreviation Number entity Special tags
Disfluencies, filled-pauses, exclamations Foreign words Some other events: uncertainly
transcribed part, etc.
40
Recorded programs
News programs from one public TV station in Thailand was recorded
Total of 105 news episodes Speech corpus
35 news episodes About 17 hours of speech data
Text corpus: 70 news episodes
41