Microsoft Research Academic Services GTM Draft...Crowdsourcing Real-Time Transcription •Benefits:...
Transcript of Microsoft Research Academic Services GTM Draft...Crowdsourcing Real-Time Transcription •Benefits:...
![Page 1: Microsoft Research Academic Services GTM Draft...Crowdsourcing Real-Time Transcription •Benefits: •People understand speech well, in many different domains •People can more robustly](https://reader035.fdocuments.us/reader035/viewer/2022081523/5fdda528dcb4856cdc276376/html5/thumbnails/1.jpg)
![Page 2: Microsoft Research Academic Services GTM Draft...Crowdsourcing Real-Time Transcription •Benefits: •People understand speech well, in many different domains •People can more robustly](https://reader035.fdocuments.us/reader035/viewer/2022081523/5fdda528dcb4856cdc276376/html5/thumbnails/2.jpg)
Walter S. Assistant Professor, University of Michigan (CSE)
AudioAccessibility
![Page 3: Microsoft Research Academic Services GTM Draft...Crowdsourcing Real-Time Transcription •Benefits: •People understand speech well, in many different domains •People can more robustly](https://reader035.fdocuments.us/reader035/viewer/2022081523/5fdda528dcb4856cdc276376/html5/thumbnails/3.jpg)
Speech/Audio Access
Speech Events Language
![Page 4: Microsoft Research Academic Services GTM Draft...Crowdsourcing Real-Time Transcription •Benefits: •People understand speech well, in many different domains •People can more robustly](https://reader035.fdocuments.us/reader035/viewer/2022081523/5fdda528dcb4856cdc276376/html5/thumbnails/4.jpg)
Speech/Audio Access
Speech Events Language
![Page 5: Microsoft Research Academic Services GTM Draft...Crowdsourcing Real-Time Transcription •Benefits: •People understand speech well, in many different domains •People can more robustly](https://reader035.fdocuments.us/reader035/viewer/2022081523/5fdda528dcb4856cdc276376/html5/thumbnails/5.jpg)
Real-Time Speech Transcription (Captioning)
• Goal: convert spoken language into written text quickly• To align visual / referenced content with speech requires doing this in < 5-10s
• Challenges:• Speakers vary (accent, intonation, speech patterns, prosody, speed, cold/flu/etc., …)
• Environments vary (echo, distance, acoustic properties, background noise, …)
• Settings vary (speaker direction, mic type, mic location/movement, content topic, …)
• Natural language is weird (highly variable/flexible, use of anaphora, vague references, …)
• Spoken language is weirder (stops / restarts / repeats, less grammatical, less formal, …)
![Page 6: Microsoft Research Academic Services GTM Draft...Crowdsourcing Real-Time Transcription •Benefits: •People understand speech well, in many different domains •People can more robustly](https://reader035.fdocuments.us/reader035/viewer/2022081523/5fdda528dcb4856cdc276376/html5/thumbnails/6.jpg)
Crowdsourcing Real-Time Transcription
• Benefits:• People understand speech well, in many different domains
• People can more robustly adapt to changes on the fly
• People can keep up with an evolving conversation over long periods of time
• Challenges:• People cannot type fast enough to capture a useful piece of continuous speech!• (usually people get 10-20% of what is said, depending on typing proficiency)
![Page 7: Microsoft Research Academic Services GTM Draft...Crowdsourcing Real-Time Transcription •Benefits: •People understand speech well, in many different domains •People can more robustly](https://reader035.fdocuments.us/reader035/viewer/2022081523/5fdda528dcb4856cdc276376/html5/thumbnails/7.jpg)
Students
3-4s?
Media Server
[ Lasecki et al., UIST 2012 ]
Crowd
Merging Server
Students
Speaker
Scribe: Real-Time Transcription by Non-Experts
![Page 8: Microsoft Research Academic Services GTM Draft...Crowdsourcing Real-Time Transcription •Benefits: •People understand speech well, in many different domains •People can more robustly](https://reader035.fdocuments.us/reader035/viewer/2022081523/5fdda528dcb4856cdc276376/html5/thumbnails/8.jpg)
1-3 sec 1-3 sec 1-3 sec 1-3 sec
[ Lasecki et al., UIST 2012 ]
Scribe: Real-Time Transcription by Non-Experts
![Page 9: Microsoft Research Academic Services GTM Draft...Crowdsourcing Real-Time Transcription •Benefits: •People understand speech well, in many different domains •People can more robustly](https://reader035.fdocuments.us/reader035/viewer/2022081523/5fdda528dcb4856cdc276376/html5/thumbnails/9.jpg)
[ Lasecki et al., UIST 2012 ]
Scribe: Real-Time Transcription by Non-Experts
![Page 10: Microsoft Research Academic Services GTM Draft...Crowdsourcing Real-Time Transcription •Benefits: •People understand speech well, in many different domains •People can more robustly](https://reader035.fdocuments.us/reader035/viewer/2022081523/5fdda528dcb4856cdc276376/html5/thumbnails/10.jpg)
the brown fox jumped
quick fox lazy dog
Fox jumped over the lazy
Combiner
the quick brown fox jumped over the lazy dog
Final Caption
[ Lasecki et al., UIST 2012 ]
Scribe: Real-Time Transcription by Non-Experts
![Page 11: Microsoft Research Academic Services GTM Draft...Crowdsourcing Real-Time Transcription •Benefits: •People understand speech well, in many different domains •People can more robustly](https://reader035.fdocuments.us/reader035/viewer/2022081523/5fdda528dcb4856cdc276376/html5/thumbnails/11.jpg)
the brown fox jumped
quick fox lazy dog
Fox jumped over the lazy
Combiner
the quick brown fox jumped over the lazy dog
Final Caption
[ Lasecki et al., UIST 2012 ]
95% recall, 87% precision, in 2.9s
Scribe: Real-Time Transcription by Non-Experts
![Page 12: Microsoft Research Academic Services GTM Draft...Crowdsourcing Real-Time Transcription •Benefits: •People understand speech well, in many different domains •People can more robustly](https://reader035.fdocuments.us/reader035/viewer/2022081523/5fdda528dcb4856cdc276376/html5/thumbnails/12.jpg)
Hybrid Intelligence for Real-Time Transcriptionthe brown fox jumped
quick fox lazy dog
Fox jumped over the lazy
Combiner
the quick brown fox jumped over the lazy dog
Final Caption
![Page 13: Microsoft Research Academic Services GTM Draft...Crowdsourcing Real-Time Transcription •Benefits: •People understand speech well, in many different domains •People can more robustly](https://reader035.fdocuments.us/reader035/viewer/2022081523/5fdda528dcb4856cdc276376/html5/thumbnails/13.jpg)
Takeaways: Speech Access Powered by the Crowd• We can create systems that let people collectively
outperform the individual on streaming tasks
• Domain experts (volunteers, students, etc.) can help• More generally: democratize access technology by lowering barriers
• Provides a framework for training ASR on the fly• We can train ASR while testing it, and combine human and machine input as we go
• This intermix ratio will change over time as automation (ASR) becomes more robust
![Page 14: Microsoft Research Academic Services GTM Draft...Crowdsourcing Real-Time Transcription •Benefits: •People understand speech well, in many different domains •People can more robustly](https://reader035.fdocuments.us/reader035/viewer/2022081523/5fdda528dcb4856cdc276376/html5/thumbnails/14.jpg)
![Page 15: Microsoft Research Academic Services GTM Draft...Crowdsourcing Real-Time Transcription •Benefits: •People understand speech well, in many different domains •People can more robustly](https://reader035.fdocuments.us/reader035/viewer/2022081523/5fdda528dcb4856cdc276376/html5/thumbnails/15.jpg)