Mr. L.R. Premkumar

download Mr. L.R. Premkumar

of 52

Transcript of Mr. L.R. Premkumar

An Introduction to Speech Annotation

L.R. PREM KUMARSenior Research AssistantLinguistic Data Consortium for Indian Languages Central Institute of Indian Languages, Mysore

Copyright 2008 LDC-IL, CIIL

1

OverviewWhat is a Corpus? p Speech Corpus and Types Why do we need speech corpus? Use of Speech Corpus Using Speech Corpus in NLP Application LDCIL Speech Corpora How to Annotated Speech Corpus using Praat? Guidelines for Annotation Recording of the data Storing LDC-IL Data in NIST Format The NIST Format Utility of Annotation

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-122

What is a corpus?'Corpus' means 'body' in Latin, and literally refers to Corpus body the biological structures that constitute humans and other animals (Wikipedia). Corpus is a collection of spoken language stored on computer and used for language research and writing dictionaries (Macmillan Dictionary 2002). It is a collection of written or spoken texts (Oxford Dictionary 2005). In other words, corpus is a collection of linguistic data, either compiled as written texts or as a transcription of recorded speech speech.SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-123

Speech CorpusSpeech Corpus (or spoken corpus) is a database of speech audio files and text transcriptions i a f i i in format that can b used to create acoustic h be d i model (which can then be used with a speech recognition engine). There are two types of Speech Corpora They are Read Speech and Spontaneous speech 1. Read Speech includes: Book excerpts p Broadcast news Lists of words Sequences of numbers 2. Spontaneous Speech includes Dialogs - between two or more people (includes meetings) Narratives - a person telling a story Map-tasks - one person explains a route on a map to anotherSRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-124

Why do we need speech corpus?To develop tools that facilitate collection of high quality speech data Collect data that can be used for building speech g p recognition. speech synthesis and provide speech-tospeech translation from one language to another language spoken in India (including Indian English).

Copyright 2008 LDC-IL, CIIL

SRM University

29-Jan-12 5

Use of Speech CorpusSpeech Recognition and Speech Synthesis p g p y Speech to Speech translation for a pair of Indian languages Health care (Medical Transcription) Real time voice recognition Multimodal interfaces to the computer in Indian languages E-mail readers over th t l h E il d the telephone Readers for the visually disadvantaged Automatic translation etc etc.SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-126

Using Speech Corpus in NLP ApplicationAutomatic speech recognition is the process by p g p y which a computer maps an acoustic speech signal to text. Speech synthesis is the artificial production of human speech A computer system used for this speech. purpose is called a speech synthesizer. It is what helps towards building Text-to-speech applicationsSRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-127

LDCIL Speech Corpora

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-128

Speech Dataset CollectionPhonetically Balanced Vocabulary 800 Phonetically Balanced Sentences 500 Connected Text created using phonetically balanced vocabulary - 6 Date Format - 2 Command and Control Words 250 Proper Nouns 400 place and 400 person names - 824 Most Frequent Words- 1000 Form and Function Words- 200 News domain: news, editorial, essay - each text not less than 500 words - 150SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-129

Number of SpeakersData will be collected from minimum of 450 (225 Male ( and 225 Female) speakers of each language. In addition to this, natural conversation data from various domains too shall be collected for Indian languages for research into spoken language.

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1210

Speech Corpora (Segmented & Ware Housed)S.NO LANGUAGES SPEAKERS 1 2 3 4 5 6 7 8 9 Assamese Bengali Bodo Dogri Gujarati Hindi Kannada Kashmiri Konkani HOURS S.NO LANGUAGES SPEAKERS 11 12 13 14 15 16 17 18 19Copyright 2008 LDC-IL, CIIL

HOURS

456 105:51:38 472 138:18:47 433 201:10:48 154 111:32:11 450 156:23:04 450 163:25:47 492 143:28:54 150 44:59:07

Malayalam Manipuri Marathi Nepali Oriya Punjabi Tamil Telugu UrduSRM University

314 105:47:05 457 107:10:27 306 168:13:50 485 145:04:46 462 165:30:05 468 110:48:26 453 213:37:27 156 50:51:36

455 195:14:47 156 43:33:42

480 124:19:5829-Jan-1211

10 Maithili

Speech SegmentationSegmentation of data:Collected speech data is in a continuous form and hence it has to be segmented as per the various content types. i.e., Text, Sentences, Words. Text Sentences Words Segmentation tools:Wave Surfer is the tool used for segmentation of speech data.

Warehousing:After segmenting the data according to the various content types, types it has to be warehoused properly The data has to be properly. warehoused for each content type, using the Meta data information.

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1212

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1213

Meta DataInvestigatorName Language Datasheetscript Agegroup A Soundfilesformat District Mothertongue Placeofelementaryeducation Recordingdate Duration/lengthofrecordeditem(hh.mm.ss) Speaker'sID Dialect(Region) SpeakersGender Recordingenvironment R di i State Place Educationalqualification

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1214

Speech AnnotationAnnotation of data: Data to be used for speech recognition shall be annotated at phoneme, syllable, word and sentence levels Data to be used for speech synthesis shall be annotated at phone, phoneme, syllable, word, and phrase level. Annotation tools: A t ti t l Tools will be developed for semiautomatic annotation of speech data. These tools will also be useful for annotating speech synthesis databases.

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1215

Speech SegmentationSpeech data is in a continuous form and hence it has to p be segmented at sentence level by using Wave Surfer Tool Open the file in Wavesurfer. Select waveform and open the file. Each sentence should be segmented but the duration of the sentence should b no l d i f h h ld be longer than 30 h seconds. If the sentence is longer than 30 seconds then the sentence should be segmented taking the nearest pause before a full stop. Then the selection should then be saved in the required folder. qSRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1216

How to Annotated Speech Corpus using Praat? p p g Praat?

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1217

What is speech annotation?Annotation is about assigning different tags like background noise, background speech, vocal noise, echo etc to the segmented speech files. While annotating the files we should also keep in mind that the text should correspond the speech speech. The term linguistic annotation covers any descriptive or analytic notations applied to raw language data. The added notations pp g g may include information of various kinds: multi-tier transcription of speech in terms of units such as acoustic-phonetic features, syllables, syllables words etc syntactic and semantic analysis etc; analysis; paralinguistic information (stress, speaking rate) non-linguistic information (speaker's gender, age, voice quality, emotions, dialect room acoustics, additive noise, channel effects).SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1218

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1219

Formation of LDCIL GuidelineThere are various tools available for speech segmentation and annotation like CSL, EMU, Transcriber, PRAAT etc. We are using PRAAT software for the annotation of our speech data. Praat is a product of Phonetic Sciences department of University of Amsterdam [4] and hence oriented for acousticphonetic studies by phoneticians. It has multiple functionalities that include speech analysis/synthesis and manipulation, labeling and segmentation, listening experiments. Guidelines for Annotation of our Data is Adapted from CSLU, OGI, Missippi University, SwitchBoard Guidelines, LDC, Upenn. USRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1220

Guidelines for AnnotationOpen the stereo file in Praat and then create a text file. p Open both the files in Praat and then select the correct text which corresponds to the speech file. The data should be annotated as per the pronunciation. If it is pronounced wrongly then it should be pronounced wrongly. wrongly The following should be marked while annotating the text:

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1221

1. Non Speech sounds should be markedHuman Noise: (a) Background speech (.bs) (b) Vocal noise (.vn). Voca o se (.v ). Non-Human Speech: (c) Background noise ( bn) (.bn)

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1222

1. Non Speech sounds should be markedHuman S H Speech h Non-Human Non Human Speech

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1223

2. Three different types of silences need to be marked Annotation of silences: Any silence shorter than y 50 ms NEED NOT be marked. short silence (possibly intraword) (sil1) silences of length (sil1): around 50-150 ms medium silence (possibly interword) (sil2): silences of length between 150-300 ms c) long silence (possibly interphrase) (sil3) : silences greater than 300 msSRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1224

2. Three different types of silences need to be marked

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1225

3. Echo need to be markedA sound that is heard after it has been reflected off a surface such as a wall. Annotation for h A t ti f echo: mark .ec i th b i i of th k in the beginning f the annotation.

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1226

3. Echo need to be marked

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1227

4. Multi speakers data need to be annotatedA new speaker at the foreground level speaks p g p < text spoken > A new annotation to mark this is defined (.sc)

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1228

5. Cut off speech and intended speech need to be marked

[ [mini]*ster - means that the speaker intended to speak ] p p minister but spoke mini in an unclear fashion and ster clearly. *ster means that the speaker intended to speak minister but spoke ster ster.

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1229

5. Cut off speech and intended speech need to be marked

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1230

6. Language ChangeLanguage change like code mixing and code switching g g g g g need to be marked as follows: [.lc-english ]

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1231

6. Language Change

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1232

7. Annotation of speech disfluencyOnly restarts/false starts need to be marked. For y / example the speaker intends to speak bengaluru but speaks be bengaluru. Then mark this as be-bengaluru

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1233

7. Annotation of speech disfluency

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1234

8. NumberSpell out all number sequences except in cases such as p q p 123 or 101 where the numbers have a specific meaning. Transcribe years like 1983 as spoken nineteen eighty three. Do not use hyphens (twenty eight, not twenty-eight).

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1235

8. Number

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1236

9. MispronunciationsIf a speaker mispronounces a word and the p p mispronunciation is not an actual word, transcribe the word as it is spoken. Utterances should be no longer that 30secs. So the annotator should find a long silence around 500 ms and split the sentence appropriately. Keep a separate Folder for the noisy data and for the time being as it was suggested not to annotate those now. SNR measuring tool will give you the percentage of the data which has to be annotated annotated.SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1237

9. Mispronunciations ; ; , . .ec sil2 sil2 sil1 .vn sil2 sil2 .vn

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1238

Some more points to be taken into accountVocal noise followed by a silence: .vn silx (if the silence is more than 50ms) Vocal noise, silence and vocal noise and then again silence or the th vocal noise which is more than 50ms : .vn silx .vn silx l i hi h i th 50 il il Please mark the silence in case of background speech too: .bs silx If background noise followed by a silence or if .bn is more than 50ms: bn silx If there is any background noise in any particular position then marked that within square bracket. [.bn ..]

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1239

Recording of the dataData should be recorded in stereo format. The wave files should be preserved i f d in four diff different f t forms: Left channel Right channel Converted to mono Original stereo Nist files are created for all the above wave files It is a format in which the files. files are saved. For example, if a single stereo file is defined as S1_0001.wav, it will be stored as: left i l ft microphone: S1_0001_left.nist h S1 0001 l ft i t right microphone: S1_0001_right.nist converted mono: S1_0001_mono.nist original: S1_0001_stereo.nistSRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1240

Storing LDC IL data in NIST format LDC-ILLDC-IL has collected read speech consisting of sentences and words. The details are as follows: Database of 19 Indian languages have been collected. Minimum 450 speakers were used to collect the database for each language. Environment of the recording is taken into consideration. All recordings are done in stereo. Age group of the speakers are recorded. The sampling rate varies as 44100 Hz, 48000 Hz, at 16 bits. All the above information must go into the header of the NIST file. Items 4 and 6 are generated automatically by the PRAAT software. Labels have to be given for items 1 2 and 5 1,2 5.SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1241

TheNISTFormat The NIST FormatAs some data has been labeled, it is felt that as decided earlier that all recordings must be converted to the nist format. Each line in the header is a triplet this gives various kinds of information about the waveform, namely, database name, speaker information, number of channels, environment. Eg:NIST_1A 1024 database_id -s13 CIIL_PUN_READ database_version -s3 1.0 recording_environment -s3 HOM microphone -s5 INBDR utterance_id -s8 sent_001 speaker_id -s9 STND_fad004 age -i 25 rec_type -s6 stereo channel_count -i 1 sample_count -i 601083 sample_n_bytes sample n bytes -i 2 sample_byte_format -s2 01 sample_coding -s3 pcm sample_rate sample rate -i 44100 sample_min -i -32768 sample_max -i 32767 end_head end head

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1242

1. 1 Database IDThe following was decided for the database id. The database id will be string of length 8 characters. The first 4 letters will correspond to the organization that collects the database. This will be followed by an _ . The language id will consist of three characters. y g g For example, the Punjabi database collected at CIIL will be given the following name: CIIL_TAM CIIL TAM This will be included in the header using the following tag: database_id s13 CIIL_TAM_READ; database id -s13 CIIL TAM READ; tag for read Tamil speech Database version: This will be included in the header using the following tag: database_version -s3 1.0SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1243

2.Recordingenvironment 2 Recording environmentThe recording environment could be one of the following: 1. home (HOM) 2. public places (PUB) 3. office (OFF) 4. Telephone (TEL) Eg: Data recorded in a home should have the following entry in the NIST header: recording_environment -s3 HOM

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1244

3. 3 MicrophonesFor the collection of LDC-IL speech data, we have used in-built digital recorder microphone (stereo)(INBDR). However the following are the other types of microphones i h ( t )(INBDR) H th f ll i th th t f i h that can be used. 1. external low quality (LOWQ) 2. t 2 external hi h quality ( i cancelling) (HIGQ) l high lit (noise lli ) 3. in-built cell phone(INBCP) 4. in-built landline (INBLL) 5. throat microphone (THROT) 6. bone microphone (BONE) Examples: p Data recorded using a digital recorder with an in-built microphone(s), should have the following entry in the NIST header: microphone -s5 INBDR s5SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1245

4. 4 Utterance IDEach utterance may be identified by type and number _For example the entry in the header would be:

utterance_id -s8 _For example, the entry for the 5th word in the database will be:

utterance_id -s8 word_005

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1246

5. 5 Speaker IDEach speaker may be identified using 9 characters: _ The entry for each speaker will be: speaker_id -s9 _ A female speaker from South Karnataka with a speaker id ab0a (4 character alpha (lower case only) numeric): speaker_id -s9 STND_fab0a

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1247

6. 6 Age group of the speakerThe speaker should be more that 16 years and not more than 60 years.

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1248

Utility of AnnotationAnnotated speech data is the raw material for development of speech recognition and speech synthesis systems. Acoustic-phonetic systems Acoustic phonetic study of the speech sounds of a language is essential for determining the parameters of speec sy es s sys e s o ow g a cu a o y o speech synthesis systems following articulatory or parametric approach.

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1249

LDCIL Tamil TeamAcademic Faculties S. Thennarasu, Sr. Lecturer L.R. L R Prem Kumar Sr. Research Assistant Kumar, Sr R. Amudha, Junior Research Assistant R. Prabagaran, R Prabagaran Junior Resource Person Technical Faculties Mohamed Yoonus, Sr. Lecturer Vadivel, Vadivel LecturerSRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1250

Speech Annotation Demo

SRM UniversityCopyright 2008 LDC-IL, CIIL

29-Jan-1251

Tamil A d T il Academy, SRM University U i i All the Professors, Teachers, Staff and the Participants & LDCIL Team