1 Problems and Prospects in Collecting Spoken Language Data Kishore Prahallad Suryakanth V...

11

Problems and Problems and Prospects in Prospects in

Collecting Spoken Collecting Spoken Language DataLanguage Data

Kishore PrahalladKishore Prahallad

Suryakanth V GangashettySuryakanth V Gangashetty

B. YegnanarayanaB. Yegnanarayana

Raj ReddyRaj ReddyIIIT Hyderabad, IndiaIIIT Hyderabad, India

Carnegie Mellon University, USA.Carnegie Mellon University, USA.

22

OutlineOutline

Need for digital library of audio and Need for digital library of audio and video datavideo data

Characteristics of spoken language Characteristics of spoken language datadata

Prototype data collectionPrototype data collection– IIIT HyderabadIIIT Hyderabad– IIT MadrasIIT Madras– Lessons LearntLessons Learnt

Proposal to collect IL data Proposal to collect IL data – as a part of Jimbaker’s global project.as a part of Jimbaker’s global project.

33

Need for Digital Library of Need for Digital Library of Audio & Video DataAudio & Video Data

Current and future data will be in audio and video formatsCurrent and future data will be in audio and video formats

Current technology makes it possible to digitize and store such Current technology makes it possible to digitize and store such large amounts of datalarge amounts of data

Collection, storage and indexing of such data makes it possible to Collection, storage and indexing of such data makes it possible to provide information to current and future generationprovide information to current and future generation

Acts as test bed for several research challenges exists in Acts as test bed for several research challenges exists in organizing, indexing and retrieving such large data collections organizing, indexing and retrieving such large data collections

– Algorithms for quick and easier access to the information Algorithms for quick and easier access to the information present in AV format by providing a query using text / audio / present in AV format by providing a query using text / audio / video modesvideo modes

– Algorithms using multi-modal data for bio-metric authenticationAlgorithms using multi-modal data for bio-metric authentication

– Development of multi-lingual speech synthesis and speech Development of multi-lingual speech synthesis and speech recognition systemsrecognition systems

44

Characteristics of Spoken Language Characteristics of Spoken Language DataData

Message - Message - Information to be conveyedInformation to be conveyed

Speaker Speaker – Who is the speaker?– Who is the speaker? His/her background His/her background – Age, gender, literacy levels, – Age, gender, literacy levels,

knowledge levels, mannerisms etc.knowledge levels, mannerisms etc. Emotions Emotions – Anger, sad, happy etc. – Anger, sad, happy etc.

Idiolect Idiolect – An individual distinctive style of speaking– An individual distinctive style of speaking

Medium of transmission Medium of transmission – Microphone, telephone, – Microphone, telephone, satellite etc. satellite etc.

Environment Environment - party-environment, airport/station,- party-environment, airport/station,

Language Language Dialect Dialect – grammar and the vocabulary associated with a – grammar and the vocabulary associated with a

regional or social use of a language.regional or social use of a language.

Culture and civilization Culture and civilization – The richness of usage of – The richness of usage of vocabulary, grammar etc, indicates the times of the language vocabulary, grammar etc, indicates the times of the language and the society.and the society.

55

Characteristics of Spoken Characteristics of Spoken Language DataLanguage Data

How a language was spoken 25 years ago, 50 years ago, How a language was spoken 25 years ago, 50 years ago, 100 years ago and beyond?100 years ago and beyond?

How a famous poem was recited or sung by the author?How a famous poem was recited or sung by the author?

How a particular language was spoken in different How a particular language was spoken in different geographical locations of a state/country?geographical locations of a state/country?

How a particular language/dialect has evolved over a How a particular language/dialect has evolved over a period of time?period of time?

What were the rare languages/dialects (which were no What were the rare languages/dialects (which were no more in existence)?. How they were spoken?more in existence)?. How they were spoken?

66

Phase 0: Prototype data Phase 0: Prototype data collection at IIIT Hydcollection at IIIT Hyd

High quality studio recordingsHigh quality studio recordings– 2 hrs of single speaker recordings for speech 2 hrs of single speaker recordings for speech

synthesis synthesis – Telugu, Hindi, Tamil and Indian-EnglishTelugu, Hindi, Tamil and Indian-English– Developed text to speech systems in these 4 Developed text to speech systems in these 4

languageslanguages Telephone and Cell-phone corpusTelephone and Cell-phone corpus

– 150 hrs (540 speakers) 150 hrs (540 speakers) – Telugu, Tamil and Marathi Telugu, Tamil and Marathi – Developed speech recognition systems in Developed speech recognition systems in

these 3 languagesthese 3 languages

77

Phase 0: Prototype data Phase 0: Prototype data collection at IIT Madrascollection at IIT Madras

15 hours (72 speakers)15 hours (72 speakers) TV news in Tamil, Telugu and Hindi TV news in Tamil, Telugu and Hindi

LanguagesLanguages– Text to speech systems (TTS)Text to speech systems (TTS)– Language Identification Language Identification – Duration modeling for TTS systemsDuration modeling for TTS systems

88

Tools Aiding for Tools Aiding for Acquisition/Correction of Speech Acquisition/Correction of Speech

DataData Transcription correction tool (TCT)Transcription correction tool (TCT)

– Spoken errors at phone, syllable, word level Spoken errors at phone, syllable, word level – Background noise, abrupt begin or end, low Background noise, abrupt begin or end, low

SNR SNR – TCT corrects the above errors in three levelsTCT corrects the above errors in three levels

Audio & Video Transcription ToolAudio & Video Transcription Tool– Used to annotate movie databasesUsed to annotate movie databases

Correction of Segment labelsCorrection of Segment labels– EmulabelEmulabel

99

Lessons LearntLessons Learnt

Speech correction needs 3-6 times Speech correction needs 3-6 times more than collectionmore than collection– Better to collect more data than correctingBetter to collect more data than correcting

Needs a unified framework Needs a unified framework – Standardize, processes, procedure and Standardize, processes, procedure and

toolstools Need larger collection of spoken and Need larger collection of spoken and

text corpora text corpora – For building practical speech systems in For building practical speech systems in

Indian languagesIndian languages

1010

Proposal for collection of larger Proposal for collection of larger Spoken Language Data for ILSpoken Language Data for IL

Focus of information present in Focus of information present in speech modespeech mode

Collect spoken language data from Collect spoken language data from all Indian languages and also from all Indian languages and also from neighboring countriesneighboring countries

Collect about 200,000 (.2 M) hours of Collect about 200,000 (.2 M) hours of speech speech – As a part of JimBaker’s global project of As a part of JimBaker’s global project of

collecting 1 Million hours of speechcollecting 1 Million hours of speech

1111

New in our approachNew in our approach Collection of large speech data upto 200,000 (0.2 Collection of large speech data upto 200,000 (0.2

M) hours M) hours – All Indian languages and dialectsAll Indian languages and dialects

23 official Indian languages23 official Indian languages Approx. 10,000 hours per languageApprox. 10,000 hours per language

– All types: Traditional, Read, spoken, conversational, All types: Traditional, Read, spoken, conversational, dialog, movies, broadcast etc.dialog, movies, broadcast etc.

– All modes: microphone, clean, telephone, cellphone, All modes: microphone, clean, telephone, cellphone, satellite etcsatellite etc

Standard procedure for organizing, annotating Standard procedure for organizing, annotating and indexingand indexing

More focus on larger collection (and elimination More focus on larger collection (and elimination than of correction)than of correction)

Make available this data for general public useMake available this data for general public use

1212

Key Make-A-Difference Key Make-A-Difference CapabilityCapability

Availability of information (Stories, lectures, poems, books, Availability of information (Stories, lectures, poems, books, articles) in spoken language articles) in spoken language

For illiterateFor illiterate Vision ImpairedVision Impaired

Collection and Storage of spoken language data of popular Collection and Storage of spoken language data of popular as well as rare languages & dialectsas well as rare languages & dialects

Promotes research and development in Promotes research and development in – Speech TechnologySpeech Technology

Speech-to-speech translation in Indian languagesSpeech-to-speech translation in Indian languages Phonetic engine (Language Independent)Phonetic engine (Language Independent) Speech synthesis (Text-to-speech for Indian languages)Speech synthesis (Text-to-speech for Indian languages) Speaker recognition (Text independent and dependent)Speaker recognition (Text independent and dependent) Language IdentificationLanguage Identification Speech enhancementSpeech enhancement Speech signal processingSpeech signal processing

– Biometrics: Biometrics: Multimodal: Audio-Video modesMultimodal: Audio-Video modes

– Information Access, Storage and RetrievalInformation Access, Storage and Retrieval

Audio-video data (indexing) Audio-video data (indexing) Data Mining (searching)Data Mining (searching) Speech Coding (Ultra-low bit coding)Speech Coding (Ultra-low bit coding)

1313

Implementation PlanImplementation Plan

Phase 1: (3.5 months)Phase 1: (3.5 months)– 10 languages10 languages– 33,300 hours33,300 hours

Phase 2: (8 months)Phase 2: (8 months)– 10 (of phase 1) languages10 (of phase 1) languages– 66,000 hours66,000 hours

Phase 3: (10 months)Phase 3: (10 months)– 13 - remaining languages 13 - remaining languages – 80,000 hours 80,000 hours

1414

Mid-Term and Final TermsMid-Term and Final Terms

Mid-Term Mid-Term – Phase 1, collection of 33,300 hours of speechPhase 1, collection of 33,300 hours of speech– Collection, Storage and Indexing of speech Collection, Storage and Indexing of speech

data for public information accessdata for public information access– Visible research output using the speech data Visible research output using the speech data – Demonstrations of speech technology productsDemonstrations of speech technology products

Speech recognition in 10 languages Speech recognition in 10 languages

Final TermFinal Term– Phase 1 + Phase 2Phase 1 + Phase 2

1515

Q & AQ & A

1616

Misc….Misc….

1717

Impact of Audio Digital Impact of Audio Digital LibraryLibrary

Availability of information in spoken language Availability of information in spoken language form for illiterate and othersform for illiterate and others

Promotes research in speech technology for Promotes research in speech technology for Indian languages Indian languages

Enable to develop speech technology products Enable to develop speech technology products useful for common manuseful for common man

Examples:Examples:– Speech-speech translation systems Speech-speech translation systems

For information exchange For information exchange – Screen readers, Screen readers,

For illiterate and physically challengedFor illiterate and physically challenged– Naturally speaking dialog systems Naturally speaking dialog systems

For information access over voice modeFor information access over voice mode

1818

Phase 1: Time EstimatePhase 1: Time Estimate Phase 1: Phase 1:

– 10 official Indian languages 10 official Indian languages – Parallel collection of dataParallel collection of data– ~ 3000 hours per language~ 3000 hours per language

5,000 - 10,000 speakers5,000 - 10,000 speakers > 10 min of speech each per speaker> 10 min of speech each per speaker

– Total: 33,300 hoursTotal: 33,300 hours Time Estimates: (~ 3.5 months all 10 languages)Time Estimates: (~ 3.5 months all 10 languages)

– 10 persons-team per language10 persons-team per language– Each person works Each person works

8 hours a day8 hours a day 30 mins of speech recording per hour 30 mins of speech recording per hour

– 1-3 speakers per hour 1-3 speakers per hour 240 mins of speech per day 240 mins of speech per day

– 1-24 speakers per day, 1-24 speakers per day, – 240 speakers per day240 speakers per day– 20,000 speakers per language in 84 working days20,000 speakers per language in 84 working days

1919

Phase 1: Cost EstimatePhase 1: Cost Estimate

Man power cost: Rs 140 LakhsMan power cost: Rs 140 Lakhs Equipment cost: Rs 55 LakhsEquipment cost: Rs 55 Lakhs Communication cost: Rs 40 LakhsCommunication cost: Rs 40 Lakhs Contingency (10%): Rs 25 LakhsContingency (10%): Rs 25 Lakhs

Total Cost: Rs 2.6 Crores (~ $ 565,000)Total Cost: Rs 2.6 Crores (~ $ 565,000)

2020

Man-Power CostMan-Power Cost

Data collection Team: Rs 86 lakhsData collection Team: Rs 86 lakhs 10 (for data collection) x Rs 10 K PM10 (for data collection) x Rs 10 K PM 10 (for data correction) x Rs 10 K PM10 (for data correction) x Rs 10 K PM 1 data manager (Rs 15 K PM)1 data manager (Rs 15 K PM) 4 months cost: 8, 60, 000 per language4 months cost: 8, 60, 000 per language

5 engineers: Rs 4 Lakhs5 engineers: Rs 4 Lakhs– B.Tech Level (Rs 20,000 PM)B.Tech Level (Rs 20,000 PM)

Gifts per speaker: Rs 50 LakhsGifts per speaker: Rs 50 Lakhs– Rs 25 per speakerRs 25 per speaker

2121

Machines CostMachines Cost

Machines: Machines: – 30 servers: Rs 30 Lakhs 30 servers: Rs 30 Lakhs

3 servers per languages3 servers per languages Each server has 4 ports for data collectionEach server has 4 ports for data collection

– 30 CTI cards: Rs 20 Lakhs30 CTI cards: Rs 20 Lakhs Storage: 20 TB: Rs 5 Lakhs Storage: 20 TB: Rs 5 Lakhs

– Two copies of 20 TBTwo copies of 20 TB

2222

Communications CostCommunications Cost

Telephonic charges: Rs 20 LakhsTelephonic charges: Rs 20 Lakhs– Rs 1 per min (local telephonic charges)Rs 1 per min (local telephonic charges)

Transportation: Rs 20 LakhsTransportation: Rs 20 Lakhs

1 Problems and Prospects in Collecting Spoken Language Data Kishore Prahallad Suryakanth V...

Documents

Transcript of 1 Problems and Prospects in Collecting Spoken Language Data Kishore Prahallad Suryakanth V...