Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected...

22
Design, compilation and Design, compilation and processing of CUCall: a set of processing of CUCall: a set of Cantonese spoken language corpora Cantonese spoken language corpora collected over telephone networks collected over telephone networks by by W.K. Lo, P.C. Ching, Tan Lee and Helen Meng W.K. Lo, P.C. Ching, Tan Lee and Helen Meng The Chinese University of Hong Kong The Chinese University of Hong Kong at at ROCLING XIV ROCLING XIV 16th August 2001 16th August 2001

Transcript of Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected...

Page 1: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

Design, compilation and processing of Design, compilation and processing of CUCall: a set of Cantonese spoken CUCall: a set of Cantonese spoken language corpora collected over language corpora collected over

telephone networkstelephone networks

byby

W.K. Lo, P.C. Ching, Tan Lee and Helen MengW.K. Lo, P.C. Ching, Tan Lee and Helen Meng

The Chinese University of Hong KongThe Chinese University of Hong Kong

atat

ROCLING XIVROCLING XIV

16th August 200116th August 2001

Page 2: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

AcknowledgmentAcknowledgment

• The The CUCallCUCall data collection is conducted data collection is conducted

under the support from the Innovation under the support from the Innovation

and Technology Fund (AF/96/99)and Technology Fund (AF/96/99)

• We are also grateful to the industrial We are also grateful to the industrial

sponsors:sponsors:

– Group Sense LimitedGroup Sense Limited

– SmarTone Mobile Communication LimitedSmarTone Mobile Communication Limited

Page 3: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

OutlineOutline

• Corpus Design and OrganizationCorpus Design and Organization

– phonetically orientedphonetically oriented

– application orientedapplication oriented

• Data Collection and ProcessingData Collection and Processing

• Data AnalysisData Analysis

• ConclusionsConclusions

Page 4: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

Part I:Part I:Corpus Design and OrganizationCorpus Design and Organization

Page 5: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

OverviewOverview

• extension to the CUCorpora microphone extension to the CUCorpora microphone speech databasespeech database

• collection of telephone speech data over collection of telephone speech data over fixed-line and mobile networksfixed-line and mobile networks

• allow phonetically oriented and domain allow phonetically oriented and domain specific applicationsspecific applications– rich phonetic coverage with speaking style rich phonetic coverage with speaking style

variationsvariations– words, phrases and digit strings for specific words, phrases and digit strings for specific

useuse

Page 6: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

CUCall OrganizationCUCall Organization

Page 7: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

Phonetically OrientedPhonetically Oriented

• 5719 5719 sentencessentences– select from the pools of CUSENT training select from the pools of CUSENT training

and testing setand testing set– target for phonetic coverage in a biphone target for phonetic coverage in a biphone

contextcontext

• 90 short paragraphs90 short paragraphs– enrich the phonetic coverage in additional enrich the phonetic coverage in additional

to the sentence materialsto the sentence materials– capture the variations brought about by the capture the variations brought about by the

lengthy nature of the reading materialslengthy nature of the reading materials

Page 8: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

Phonetically OrientedPhonetically Oriented

• 6 6 spontaneous conversationspontaneous conversation– capture speakers’ spontaneous responsecapture speakers’ spontaneous response

– content is unlimited and unconstrainedcontent is unlimited and unconstrained

– contains all kinds of non-speech events, contains all kinds of non-speech events, e.g. correction, hesitation, skipped word, …e.g. correction, hesitation, skipped word, …

– questions must be simple and open-endedquestions must be simple and open-ended

Page 9: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

Phonetically OrientedPhonetically Oriented

• Criteria for the questions designCriteria for the questions design– simple enough for spontaneous response; simple enough for spontaneous response;

avoid calculation, memory recall etc.avoid calculation, memory recall etc.– answers are expected to be different for answers are expected to be different for

different speakersdifferent speakers– responses may be either long or shortresponses may be either long or short– avoid answers that are relevant to speakers’ avoid answers that are relevant to speakers’

privacyprivacy

Page 10: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

Application OrientedApplication Oriented

• 1440 1440 words and phraseswords and phrases– simple words cover various domainssimple words cover various domains

• names of placesnames of places• listed companieslisted companies• foreign currenciesforeign currencies• navigation commandsnavigation commands

• Digit stringsDigit strings– strings of digits of various lengthstrings of digits of various length

• all ten single digitsall ten single digits• random generated strings of length 7, 8 and 16random generated strings of length 7, 8 and 16

Page 11: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

Part II:Part II:Data Collection and ProcessingData Collection and Processing

Page 12: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

Collection ProcessCollection Process

• Preparation of reading materialsPreparation of reading materials– prepare reading materials as prompt sheetsprepare reading materials as prompt sheets– separate male & female, fixed & mobile linesseparate male & female, fixed & mobile lines

• Distribution of prompt sheetDistribution of prompt sheet– distributed hierarchically through agentsdistributed hierarchically through agents

• Speakers callSpeakers call– speakers call automatic recording serversspeakers call automatic recording servers– they are identified by unique serial numbersthey are identified by unique serial numbers

• Questionnaire returnQuestionnaire return– information on age, telephone network type information on age, telephone network type

are collectedare collected

Page 13: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

Data Collection System Set-upData Collection System Set-up

Calling End : From any location, using any telephone, by all walks of life

Calling End : From any location, using any telephone, by all walks of life

Telephone Companies :mobile/fixed line networkTelephone Companies :mobile/fixed line network

Recording End : telephone outlet,telephony hardware, recording system, data storage system

Recording End : telephone outlet,telephony hardware, recording system, data storage system

Post-processing of dataPost-processing of data for various targetedfor various targeted

domains of applicationsdomains of applications

Post-processing of dataPost-processing of data for various targetedfor various targeted

domains of applicationsdomains of applications

…..

Note : CT board is Dialogic® D/41-ESC

Recording Servers :fixed-line connection to local telephone companies

Recording Servers :fixed-line connection to local telephone companies

Page 14: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

Post-processing of DataPost-processing of Data

• Call validationCall validation– received prompt sheets are verified against received prompt sheets are verified against

the recorded speech datathe recorded speech data– user information are entered into databasesuser information are entered into databases

• Phonemic transcriptionPhonemic transcription– all accepted speech data are 100% all accepted speech data are 100%

phonemic transcribed on initial-final levelphonemic transcribed on initial-final level

• Partitioning of collected dataPartitioning of collected data– collected data are partitioned properlycollected data are partitioned properly– speech data and the transcriptions are speech data and the transcriptions are

organized per speaker basisorganized per speaker basis

Page 15: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

Validation: identify successful recording sessions

Transcription: accurateverbatim transcriptionfor the speech data

Data Storage: collected telephone speech data

Organization: organize data for easy access

Distribution: printing CDROM for distribution

.

./nei5-hou2-maa1/

.

.

\speaker01\data\001.wav \002.wav . . \speaker01\annotate\001.xsc \002.xsc . .

./nei5-hou2-maa1//ngo5-hou2-hou2/

/nei5-ne1//dou1-ng4-co3-laa1/

.

Data Processing After CollectionData Processing After Collection

Page 16: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

Part III:Part III:Data AnalysisData Analysis

Page 17: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

Statistics of Reading MaterialsStatistics of Reading Materials

Part # per speaker # tonal syl. # base syl. syl. count

Phonetically oriented corpora

sent. 50 (out of 5719) 1399 579 4 to 31

para. 3 (out of 90) 768 418 23 to 120

Application-specific corpora

1-digit 10

7-digit 5

8-digit 5

16-digit 5

words 48 (out of 1440) 562 344 2 to 8

Page 18: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

Frequency-of-frequency (FOF)Frequency-of-frequency (FOF)Sentence

Paragraph

Page 19: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

Part IV:Part IV:ConclusionsConclusions

Page 20: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

Current StatusCurrent Status

• the collection process is divided into the collection process is divided into several stagesseveral stages

• expected completion date: March 2002expected completion date: March 2002

• until now, over 200 hours of data (from until now, over 200 hours of data (from 1000 speakers) has been collected1000 speakers) has been collected– 120 hours for phonetically oriented data120 hours for phonetically oriented data– 80 hours for application-specific data80 hours for application-specific data

• over half of the collected have been over half of the collected have been phonemically transcribedphonemically transcribed

Page 21: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

ConclusionsConclusions

• design and collection process for the design and collection process for the Cantonese telephone speech corpora is Cantonese telephone speech corpora is presentedpresented

• corpora are designed to cover both corpora are designed to cover both phonetically oriented and application-phonetically oriented and application-specific dataspecific data

• include also long reading materials and include also long reading materials and open questions for spontaneous dataopen questions for spontaneous data

• details of post-processing and data details of post-processing and data analysis are givenanalysis are given

Page 22: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

Thank YouThank You