The Sheffield Wargames Corpus - Day Two and Day Three · The Sheffield Wargames Corpus - Day Two...

1
The Sheffield Wargames Corpus - Day Two and Day Three Yulan Liu 1, Charles Fox 2 , Madina Hasen 1 , Thomas Hain 11 MINI, SpandH, The University of Sheffield, UK 2 The University of Leeds, UK [email protected] [email protected] Introduction Speech recognition on natural conversation in natural envi- ronment is of considerable current interest. However it is challenging in application, particularly with far- field recording, due to overlapping speech, reverberation, back- ground noise, speaker motion and informal speech patterns. Most existing speech corpora lack informal natural speech with movement, and only limited data is available that contains high quality near-field and far-field recordings from real interac- tions among participants. The first recording of Sheffield Wargames Corpus (SWC1), based on a social scenario where native English speakers play a table-top game named Warhammer, collected 8.0h natural speech data for research on speech recognition, speaker track- ing and diarisation. Day 2 and Day 3 recording collects 16.6h annotated data (SWC2, SWC3), with 6.1h being female speech. All three recordings make a 24.6h annotated database in total. A Kaldi recipe is provided for standalone training using datasets defined with all three recordings. An in-domain LM is built with blog data, wiki data and conversational meeting data. Baseline results are reported for both standalone training and adaptation. WALL-01 WALL-02 WALL-03 WALL-04 GRID-01 GRID-02 GRID-03 GRID-04 GRID-05 GRID-06 GRID-07 GRID-08 TBL1 1 5 West axis camera (C1) North PTZ axis camera (C2) East axis camera (C3) http://mini-vm20.dcs.shef.ac.uk/swc/SWC-home.html SWC Statistics SWC1 SWC2 SWC3 overall #session 10 8 6 24 #game 4 4 3 11 #annotated speaker 9 11 8 22 gender M M F&M F&M #unique mic 96 71 24 103 #shared mic - - - 24 annotated speech 8.0h 10.5h 6.1h 24.6h #speech utt. 14.0k 15.4k 10.2k 39.6k duration per utt. 2.1s 2.5s 2.2s 2.2s #word per utt. 6.6 7.9 5.5 6.8 vocabulary 4.4k 5.7k 2.9k 8.5k video - location tracking Statistics of SWC1, SWC2 and SWC3. Vocabulary of SWC3 is much smaller than SWC1 and SWC2. Task and Dataset Each recording file (session) is split into three strips (A, B, C) with equal amount of annotated speech. task set strips dur. #utt. #spk. standalone-1 (SA1) train 1, {2, 3}.A 13.5h 22.6k 22 dev {2, 3}.B 5.5h 8.5k 18 eval {2, 3}.C 5.6h 8.4k 18 standalone-2 (SA2) train 1 8.0h 14.0k 9 dev {2, 3}.A 5.5h 8.7k 18 eval {2, 3}.B+C 11.1h 16.9k 18 adapt-1 (AD1) dev {1, 2, 3}.A+B 16.3h 26.2k 22 eval {1, 2, 3}.C 8.2h 13.3k 22 adapt-2 (AD2) dev 1 8.0h 14.0k 9 eval 2, 3 16.6h 25.6k 18 Baseline Systems Microphone channels IHM: individual headset microphone SDM: single distant microphone MDM: multiple distant microphones 8 channel weighted delay and sum beamforming using Beam- formIt Standalone system (Kaldi recipe) HMM-GMM LDA+MLLT LDA+MLLT+SAT (best WER on eval set for IHM: 48.8%) LDA+MLLT+SAT+MMI DHH-HMM hybrid structure DNN-HMM DNN-HMM+sMBR (best WER on eval set: IHM 42.0%, SDM 77.3%, MDM 74.9%) DNN-HMM+fMLLR DNN-HMM+fMLLR+sMBR Using in-domain LM Adaptation system DNN-HMM-GMM, using bottleneck features only DNN: fine-tuning HMM-GMM: MAP adaptation with updated bottleneck features Using in-domain LM Overall WER on eval set: IHM 47.7%, SDM 78.2%, MDM 75.0% Out of domain In-domain initialize MAP adaptation HMM-GMM AMI data SWC corpora train adapt DNN SWC LM wargame blog data wargame wiki conversational meeting data Language Model Topic and vocabulary of SWC differ from the existing data. Game related text data was harvested from four Warhammer blogs and Warhammer wikipedia pages. A 4-gram LM of 30k words is built by interpolating: LM component #words vocabulary weight Conversational web data 165.9M 457.8k 0.65 Blog 1 (addict) 21.1k 3.3k 0.05 Blog 2 (atomic) 126.8k 7.9k 0.05 Blog 3 (cadia) 40.4k 3.9k 0.19 Blog 4 (cast) 71.2k 7.0k 0.06 wikipedia (warhammer) 26.2k 4.1k 0.003 Baseline Results Standalone system dev eval overall S D I WER IHM LDA+MLLT 50.9 51.8 35.9 8.9 6.4 51.3 +SAT 48.7 48.8 34.4 8.1 6.3 48.7 +MMI 48.8 49.1 34.4 8.8 5.7 48.9 DNN 44.4 44.3 30.5 9.7 4.1 44.4 +sMBR 42.0 42.0 29.5 7.6 5.0 42.0 +fMLLR 48.1 48.1 32.9 11.4 3.8 48.1 +sMBR 44.9 44.8 31.2 9.8 3.8 44.9 SDM DNN 78.9 80.5 53.9 21.4 4.4 79.7 +sMBR 76.4 77.3 39.1 35.5 2.2 76.8 MDM DNN 76.0 77.9 53.3 18.2 5.5 76.9 +sMBR 73.8 74.9 36.0 36.0 2.4 74.3 Adaptation system dev eval SWC1 SWC2 SWC3 overall S D I WER IHM 24.9 46.4 50.5 33.4 9.3 5.0 47.7 SDM 55.2 75.0 85.2 53.2 19.1 6.0 78.2 MDM 53.5 71.6 82.4 52.4 15.4 7.3 75.0 Conclusions New recordings extend SWC1 to a 24.6h annotated database with multi-media and multi-microphone recordings. Four datasets are suggested for standalone training and adap- tation. A Kaldi recipe is prepared for standalone training. An in-domain 4-gram 30k LM is built and released. The best overall WER obtained is 42.0% for IHM, 76.8% for SDM and 74.3% for MDM, suggesting a high difficulty level of SWC corpora for ASR. Beamforming reduces WER by 3-4% relatively. Funded by EPSRC Natural Speech Technology Programme Grant EP/I031022/1

Transcript of The Sheffield Wargames Corpus - Day Two and Day Three · The Sheffield Wargames Corpus - Day Two...

Page 1: The Sheffield Wargames Corpus - Day Two and Day Three · The Sheffield Wargames Corpus - Day Two and Day Three Yulan Liu1y, Charles Fox2, Madina Hasen1, Thomas Hain1z 1MINI, SpandH,

The Sheffield Wargames Corpus - Day Two and Day ThreeYulan Liu1†, Charles Fox2, Madina Hasen1, Thomas Hain1‡

1MINI, SpandH, The University of Sheffield, UK2The University of Leeds, UK

[email protected][email protected]

Introduction• Speech recognition on natural conversation in natural envi-

ronment is of considerable current interest.•However it is challenging in application, particularly with far-

field recording, due to overlapping speech, reverberation, back-ground noise, speaker motion and informal speech patterns.•Most existing speech corpora lack informal natural speech

with movement, and only limited data is available that containshigh quality near-field and far-field recordings from real interac-tions among participants.• The first recording of Sheffield Wargames Corpus (SWC1),

based on a social scenario where native English speakers playa table-top game named Warhammer, collected 8.0h naturalspeech data for research on speech recognition, speaker track-ing and diarisation.•Day 2 and Day 3 recording collects 16.6h annotated data

(SWC2, SWC3), with 6.1h being female speech. All threerecordings make a 24.6h annotated database in total.•A Kaldi recipe is provided for standalone training using datasets

defined with all three recordings. An in-domain LM is built withblog data, wiki data and conversational meeting data. Baselineresults are reported for both standalone training and adaptation.

WALL-01WALL-02

WALL-03

WALL-04

GRID-01

GRID-02

GRID-03

GRID-04

GRID-05 GRID-06

GRID-07GRID-08

TBL11 5West axis

camera (C1)

North PTZ axis camera (C2)

East axis camera (C3)

http://mini-vm20.dcs.shef.ac.uk/swc/SWC-home.html

SWC Statistics

SWC1 SWC2 SWC3 overall#session 10 8 6 24#game 4 4 3 11

#annotated speaker 9 11 8 22gender M M F&M F&M

#unique mic 96 71 24 103#shared mic - - - 24

annotated speech 8.0h 10.5h 6.1h 24.6h#speech utt. 14.0k 15.4k 10.2k 39.6k

duration per utt. 2.1s 2.5s 2.2s 2.2s#word per utt. 6.6 7.9 5.5 6.8

vocabulary 4.4k 5.7k 2.9k 8.5kvideo

√ √-

location tracking√ √ √ √

• Statistics of SWC1, SWC2 and SWC3.• Vocabulary of SWC3 is much smaller than SWC1 and SWC2.

0 1 2 3 4 5X axis (meter)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Y a

xis

(m

ete

r)

Speaker location distribution: XY view

mn0001mn0007mn0011mn0013

Task and DatasetEach recording file (session) is split into three strips (A, B, C) withequal amount of annotated speech.

task set strips dur. #utt. #spk.

standalone-1 (SA1)train 1, {2, 3}.A 13.5h 22.6k 22dev {2, 3}.B 5.5h 8.5k 18eval {2, 3}.C 5.6h 8.4k 18

standalone-2 (SA2)train 1 8.0h 14.0k 9dev {2, 3}.A 5.5h 8.7k 18eval {2, 3}.B+C 11.1h 16.9k 18

adapt-1 (AD1) dev {1, 2, 3}.A+B 16.3h 26.2k 22eval {1, 2, 3}.C 8.2h 13.3k 22

adapt-2 (AD2) dev 1 8.0h 14.0k 9eval 2, 3 16.6h 25.6k 18

Baseline Systems

Microphone channels• IHM: individual headset microphone• SDM: single distant microphone•MDM: multiple distant microphones

– 8 channel weighted delay and sum beamforming using Beam-formIt

Standalone system (Kaldi recipe)•HMM-GMM

– LDA+MLLT– LDA+MLLT+SAT (best WER on eval set for IHM: 48.8%)– LDA+MLLT+SAT+MMI•DHH-HMM hybrid structure

– DNN-HMM– DNN-HMM+sMBR (best WER on eval set: IHM 42.0%, SDM

77.3%, MDM 74.9%)– DNN-HMM+fMLLR– DNN-HMM+fMLLR+sMBR•Using in-domain LM

Adaptation system•DNN-HMM-GMM, using bottleneck features only

– DNN: fine-tuning– HMM-GMM: MAP adaptation with updated bottleneck features•Using in-domain LM•Overall WER on eval set: IHM 47.7%, SDM 78.2%, MDM 75.0%

Out of domain In-domain

initialize

MAP adaptation

HMM-GMM

AMI data SWC corpora

train adapt

DNN

SWC LM

wargameblog data

wargamewiki

conversationalmeeting data

Language Model

• Topic and vocabulary of SWC differ from the existing data.•Game related text data was harvested from four Warhammer

blogs and Warhammer wikipedia pages.• A 4-gram LM of 30k words is built by interpolating:

LM component #words vocabulary weightConversational web data 165.9M 457.8k 0.65

Blog 1 (addict) 21.1k 3.3k 0.05Blog 2 (atomic) 126.8k 7.9k 0.05Blog 3 (cadia) 40.4k 3.9k 0.19Blog 4 (cast) 71.2k 7.0k 0.06

wikipedia (warhammer) 26.2k 4.1k 0.003

Baseline ResultsStandalone system

dev eval overallS D I WER

IHM

LDA+MLLT 50.9 51.8 35.9 8.9 6.4 51.3+SAT 48.7 48.8 34.4 8.1 6.3 48.7

+MMI 48.8 49.1 34.4 8.8 5.7 48.9DNN 44.4 44.3 30.5 9.7 4.1 44.4

+sMBR 42.0 42.0 29.5 7.6 5.0 42.0+fMLLR 48.1 48.1 32.9 11.4 3.8 48.1

+sMBR 44.9 44.8 31.2 9.8 3.8 44.9

SDM DNN 78.9 80.5 53.9 21.4 4.4 79.7+sMBR 76.4 77.3 39.1 35.5 2.2 76.8

MDM DNN 76.0 77.9 53.3 18.2 5.5 76.9+sMBR 73.8 74.9 36.0 36.0 2.4 74.3

Adaptation system

dev eval

SWC1 SWC2 SWC3 overallS D I WER

IHM 24.9 46.4 50.5 33.4 9.3 5.0 47.7SDM 55.2 75.0 85.2 53.2 19.1 6.0 78.2MDM 53.5 71.6 82.4 52.4 15.4 7.3 75.0

Conclusions•New recordings extend SWC1 to a 24.6h annotated database

with multi-media and multi-microphone recordings.• Four datasets are suggested for standalone training and adap-

tation. A Kaldi recipe is prepared for standalone training.• An in-domain 4-gram 30k LM is built and released.• The best overall WER obtained is 42.0% for IHM, 76.8% for SDM

and 74.3% for MDM, suggesting a high difficulty level of SWCcorpora for ASR. Beamforming reduces WER by 3-4% relatively.

Funded by EPSRC Natural Speech Technology Programme Grant EP/I031022/1