1 Problems and Prospects in Collecting Spoken Language Data Kishore Prahallad Suryakanth V...
-
Upload
marcia-doyle -
Category
Documents
-
view
214 -
download
0
Transcript of 1 Problems and Prospects in Collecting Spoken Language Data Kishore Prahallad Suryakanth V...
11
Problems and Problems and Prospects in Prospects in
Collecting Spoken Collecting Spoken Language DataLanguage Data
Kishore PrahalladKishore Prahallad
Suryakanth V GangashettySuryakanth V Gangashetty
B. YegnanarayanaB. Yegnanarayana
Raj ReddyRaj ReddyIIIT Hyderabad, IndiaIIIT Hyderabad, India
Carnegie Mellon University, USA.Carnegie Mellon University, USA.
22
OutlineOutline
Need for digital library of audio and Need for digital library of audio and video datavideo data
Characteristics of spoken language Characteristics of spoken language datadata
Prototype data collectionPrototype data collection– IIIT HyderabadIIIT Hyderabad– IIT MadrasIIT Madras– Lessons LearntLessons Learnt
Proposal to collect IL data Proposal to collect IL data – as a part of Jimbaker’s global project.as a part of Jimbaker’s global project.
33
Need for Digital Library of Need for Digital Library of Audio & Video DataAudio & Video Data
Current and future data will be in audio and video formatsCurrent and future data will be in audio and video formats
Current technology makes it possible to digitize and store such Current technology makes it possible to digitize and store such large amounts of datalarge amounts of data
Collection, storage and indexing of such data makes it possible to Collection, storage and indexing of such data makes it possible to provide information to current and future generationprovide information to current and future generation
Acts as test bed for several research challenges exists in Acts as test bed for several research challenges exists in organizing, indexing and retrieving such large data collections organizing, indexing and retrieving such large data collections
– Algorithms for quick and easier access to the information Algorithms for quick and easier access to the information present in AV format by providing a query using text / audio / present in AV format by providing a query using text / audio / video modesvideo modes
– Algorithms using multi-modal data for bio-metric authenticationAlgorithms using multi-modal data for bio-metric authentication
– Development of multi-lingual speech synthesis and speech Development of multi-lingual speech synthesis and speech recognition systemsrecognition systems
44
Characteristics of Spoken Language Characteristics of Spoken Language DataData
Message - Message - Information to be conveyedInformation to be conveyed
Speaker Speaker – Who is the speaker?– Who is the speaker? His/her background His/her background – Age, gender, literacy levels, – Age, gender, literacy levels,
knowledge levels, mannerisms etc.knowledge levels, mannerisms etc. Emotions Emotions – Anger, sad, happy etc. – Anger, sad, happy etc.
Idiolect Idiolect – An individual distinctive style of speaking– An individual distinctive style of speaking
Medium of transmission Medium of transmission – Microphone, telephone, – Microphone, telephone, satellite etc. satellite etc.
Environment Environment - party-environment, airport/station,- party-environment, airport/station,
Language Language Dialect Dialect – grammar and the vocabulary associated with a – grammar and the vocabulary associated with a
regional or social use of a language.regional or social use of a language.
Culture and civilization Culture and civilization – The richness of usage of – The richness of usage of vocabulary, grammar etc, indicates the times of the language vocabulary, grammar etc, indicates the times of the language and the society.and the society.
55
Characteristics of Spoken Characteristics of Spoken Language DataLanguage Data
How a language was spoken 25 years ago, 50 years ago, How a language was spoken 25 years ago, 50 years ago, 100 years ago and beyond?100 years ago and beyond?
How a famous poem was recited or sung by the author?How a famous poem was recited or sung by the author?
How a particular language was spoken in different How a particular language was spoken in different geographical locations of a state/country?geographical locations of a state/country?
How a particular language/dialect has evolved over a How a particular language/dialect has evolved over a period of time?period of time?
What were the rare languages/dialects (which were no What were the rare languages/dialects (which were no more in existence)?. How they were spoken?more in existence)?. How they were spoken?
66
Phase 0: Prototype data Phase 0: Prototype data collection at IIIT Hydcollection at IIIT Hyd
High quality studio recordingsHigh quality studio recordings– 2 hrs of single speaker recordings for speech 2 hrs of single speaker recordings for speech
synthesis synthesis – Telugu, Hindi, Tamil and Indian-EnglishTelugu, Hindi, Tamil and Indian-English– Developed text to speech systems in these 4 Developed text to speech systems in these 4
languageslanguages Telephone and Cell-phone corpusTelephone and Cell-phone corpus
– 150 hrs (540 speakers) 150 hrs (540 speakers) – Telugu, Tamil and Marathi Telugu, Tamil and Marathi – Developed speech recognition systems in Developed speech recognition systems in
these 3 languagesthese 3 languages
77
Phase 0: Prototype data Phase 0: Prototype data collection at IIT Madrascollection at IIT Madras
15 hours (72 speakers)15 hours (72 speakers) TV news in Tamil, Telugu and Hindi TV news in Tamil, Telugu and Hindi
LanguagesLanguages– Text to speech systems (TTS)Text to speech systems (TTS)– Language Identification Language Identification – Duration modeling for TTS systemsDuration modeling for TTS systems
88
Tools Aiding for Tools Aiding for Acquisition/Correction of Speech Acquisition/Correction of Speech
DataData Transcription correction tool (TCT)Transcription correction tool (TCT)
– Spoken errors at phone, syllable, word level Spoken errors at phone, syllable, word level – Background noise, abrupt begin or end, low Background noise, abrupt begin or end, low
SNR SNR – TCT corrects the above errors in three levelsTCT corrects the above errors in three levels
Audio & Video Transcription ToolAudio & Video Transcription Tool– Used to annotate movie databasesUsed to annotate movie databases
Correction of Segment labelsCorrection of Segment labels– EmulabelEmulabel
99
Lessons LearntLessons Learnt
Speech correction needs 3-6 times Speech correction needs 3-6 times more than collectionmore than collection– Better to collect more data than correctingBetter to collect more data than correcting
Needs a unified framework Needs a unified framework – Standardize, processes, procedure and Standardize, processes, procedure and
toolstools Need larger collection of spoken and Need larger collection of spoken and
text corpora text corpora – For building practical speech systems in For building practical speech systems in
Indian languagesIndian languages
1010
Proposal for collection of larger Proposal for collection of larger Spoken Language Data for ILSpoken Language Data for IL
Focus of information present in Focus of information present in speech modespeech mode
Collect spoken language data from Collect spoken language data from all Indian languages and also from all Indian languages and also from neighboring countriesneighboring countries
Collect about 200,000 (.2 M) hours of Collect about 200,000 (.2 M) hours of speech speech – As a part of JimBaker’s global project of As a part of JimBaker’s global project of
collecting 1 Million hours of speechcollecting 1 Million hours of speech
1111
New in our approachNew in our approach Collection of large speech data upto 200,000 (0.2 Collection of large speech data upto 200,000 (0.2
M) hours M) hours – All Indian languages and dialectsAll Indian languages and dialects
23 official Indian languages23 official Indian languages Approx. 10,000 hours per languageApprox. 10,000 hours per language
– All types: Traditional, Read, spoken, conversational, All types: Traditional, Read, spoken, conversational, dialog, movies, broadcast etc.dialog, movies, broadcast etc.
– All modes: microphone, clean, telephone, cellphone, All modes: microphone, clean, telephone, cellphone, satellite etcsatellite etc
Standard procedure for organizing, annotating Standard procedure for organizing, annotating and indexingand indexing
More focus on larger collection (and elimination More focus on larger collection (and elimination than of correction)than of correction)
Make available this data for general public useMake available this data for general public use
1212
Key Make-A-Difference Key Make-A-Difference CapabilityCapability
Availability of information (Stories, lectures, poems, books, Availability of information (Stories, lectures, poems, books, articles) in spoken language articles) in spoken language
For illiterateFor illiterate Vision ImpairedVision Impaired
Collection and Storage of spoken language data of popular Collection and Storage of spoken language data of popular as well as rare languages & dialectsas well as rare languages & dialects
Promotes research and development in Promotes research and development in – Speech TechnologySpeech Technology
Speech-to-speech translation in Indian languagesSpeech-to-speech translation in Indian languages Phonetic engine (Language Independent)Phonetic engine (Language Independent) Speech synthesis (Text-to-speech for Indian languages)Speech synthesis (Text-to-speech for Indian languages) Speaker recognition (Text independent and dependent)Speaker recognition (Text independent and dependent) Language IdentificationLanguage Identification Speech enhancementSpeech enhancement Speech signal processingSpeech signal processing
– Biometrics: Biometrics: Multimodal: Audio-Video modesMultimodal: Audio-Video modes
– Information Access, Storage and RetrievalInformation Access, Storage and Retrieval
Audio-video data (indexing) Audio-video data (indexing) Data Mining (searching)Data Mining (searching) Speech Coding (Ultra-low bit coding)Speech Coding (Ultra-low bit coding)
1313
Implementation PlanImplementation Plan
Phase 1: (3.5 months)Phase 1: (3.5 months)– 10 languages10 languages– 33,300 hours33,300 hours
Phase 2: (8 months)Phase 2: (8 months)– 10 (of phase 1) languages10 (of phase 1) languages– 66,000 hours66,000 hours
Phase 3: (10 months)Phase 3: (10 months)– 13 - remaining languages 13 - remaining languages – 80,000 hours 80,000 hours
1414
Mid-Term and Final TermsMid-Term and Final Terms
Mid-Term Mid-Term – Phase 1, collection of 33,300 hours of speechPhase 1, collection of 33,300 hours of speech– Collection, Storage and Indexing of speech Collection, Storage and Indexing of speech
data for public information accessdata for public information access– Visible research output using the speech data Visible research output using the speech data – Demonstrations of speech technology productsDemonstrations of speech technology products
Speech recognition in 10 languages Speech recognition in 10 languages
Final TermFinal Term– Phase 1 + Phase 2Phase 1 + Phase 2
1717
Impact of Audio Digital Impact of Audio Digital LibraryLibrary
Availability of information in spoken language Availability of information in spoken language form for illiterate and othersform for illiterate and others
Promotes research in speech technology for Promotes research in speech technology for Indian languages Indian languages
Enable to develop speech technology products Enable to develop speech technology products useful for common manuseful for common man
Examples:Examples:– Speech-speech translation systems Speech-speech translation systems
For information exchange For information exchange – Screen readers, Screen readers,
For illiterate and physically challengedFor illiterate and physically challenged– Naturally speaking dialog systems Naturally speaking dialog systems
For information access over voice modeFor information access over voice mode
1818
Phase 1: Time EstimatePhase 1: Time Estimate Phase 1: Phase 1:
– 10 official Indian languages 10 official Indian languages – Parallel collection of dataParallel collection of data– ~ 3000 hours per language~ 3000 hours per language
5,000 - 10,000 speakers5,000 - 10,000 speakers > 10 min of speech each per speaker> 10 min of speech each per speaker
– Total: 33,300 hoursTotal: 33,300 hours Time Estimates: (~ 3.5 months all 10 languages)Time Estimates: (~ 3.5 months all 10 languages)
– 10 persons-team per language10 persons-team per language– Each person works Each person works
8 hours a day8 hours a day 30 mins of speech recording per hour 30 mins of speech recording per hour
– 1-3 speakers per hour 1-3 speakers per hour 240 mins of speech per day 240 mins of speech per day
– 1-24 speakers per day, 1-24 speakers per day, – 240 speakers per day240 speakers per day– 20,000 speakers per language in 84 working days20,000 speakers per language in 84 working days
1919
Phase 1: Cost EstimatePhase 1: Cost Estimate
Man power cost: Rs 140 LakhsMan power cost: Rs 140 Lakhs Equipment cost: Rs 55 LakhsEquipment cost: Rs 55 Lakhs Communication cost: Rs 40 LakhsCommunication cost: Rs 40 Lakhs Contingency (10%): Rs 25 LakhsContingency (10%): Rs 25 Lakhs
Total Cost: Rs 2.6 Crores (~ $ 565,000)Total Cost: Rs 2.6 Crores (~ $ 565,000)
2020
Man-Power CostMan-Power Cost
Data collection Team: Rs 86 lakhsData collection Team: Rs 86 lakhs 10 (for data collection) x Rs 10 K PM10 (for data collection) x Rs 10 K PM 10 (for data correction) x Rs 10 K PM10 (for data correction) x Rs 10 K PM 1 data manager (Rs 15 K PM)1 data manager (Rs 15 K PM) 4 months cost: 8, 60, 000 per language4 months cost: 8, 60, 000 per language
5 engineers: Rs 4 Lakhs5 engineers: Rs 4 Lakhs– B.Tech Level (Rs 20,000 PM)B.Tech Level (Rs 20,000 PM)
Gifts per speaker: Rs 50 LakhsGifts per speaker: Rs 50 Lakhs– Rs 25 per speakerRs 25 per speaker
2121
Machines CostMachines Cost
Machines: Machines: – 30 servers: Rs 30 Lakhs 30 servers: Rs 30 Lakhs
3 servers per languages3 servers per languages Each server has 4 ports for data collectionEach server has 4 ports for data collection
– 30 CTI cards: Rs 20 Lakhs30 CTI cards: Rs 20 Lakhs Storage: 20 TB: Rs 5 Lakhs Storage: 20 TB: Rs 5 Lakhs
– Two copies of 20 TBTwo copies of 20 TB