Understandable Production of Massive Synthesis

17
Understandable Production of Massive Synthesis Brian Langner, Alan W Black Language Technologies Institute Carnegie Mellon University

description

Understandable Production of Massive Synthesis. Brian Langner, Alan W Black Language Technologies Institute Carnegie Mellon University. Background. Applications pushing the limits of speech synthesis becoming more common … - PowerPoint PPT Presentation

Transcript of Understandable Production of Massive Synthesis

Page 1: Understandable Production of Massive Synthesis

Understandable Production of Massive

SynthesisBrian Langner, Alan W Black

Language Technologies Institute

Carnegie Mellon University

Page 2: Understandable Production of Massive Synthesis

Background• Applications pushing the limits of speech synthesis

becoming more common– …

• Issues with perceived quality and understandability arise more frequently with more challenging uses

• Successful speech applications require understandable output

• What can we do to make speech synthesis more understandable, even in challenging applications?

Page 3: Understandable Production of Massive Synthesis

Massive Synthesis• Synthesis of such a large amount of content that

typical evaluation methods are impractical• Frequently characterized by continuously generating

new content• Examples:

– Error reports– Business case summaries– News readers– Weblogs

• Often able to simplify task, though not always

Page 4: Understandable Production of Massive Synthesis

Example Content• Weblog content is ideal

– Copious amounts available, continually generated• Easy to collect from many weblogs

– ... but is it representative?• Use an existing collection: TREC Blog06 corpus

– 11 weeks of over 100,000 RSS/Atom feeds from 2005-06– Consists of homepage + all new permalink pages weekly– Over 750,000 total collected feeds– Includes spam content for realism– Multilingual, though only concerned with English for now

Page 5: Understandable Production of Massive Synthesis

Analyzing Massive Synthesis Content

• Preprocess to remove tags and meta content– Resulting corpus is 14GB of “blog” text

• Identify word frequency differences from typical English– Try to find “blog-frequent” words unlikely to synthesize well– Flag and target for improvement strategies

• Most words fairly normal for English– Frequency for differs, but words are not unusual– Most frequent atypical words: “html”, “blog” (27th/28th)

• High occurrence of acronyms– “FAQ”, “mp3s”

Page 6: Understandable Production of Massive Synthesis

Common Problems• Prevalence and variety of non-standard words

– Technical jargon– Typos / Spelling errors

• “the-teh”, “lose-loose”, “voila-viola”, etc.– l33t5p33k– Expressive spelling

• “soooooo…..”– Usernames/handles

• Must be rendered understandably to be useful– “leet” rather than “el-three-three-tee”

• Can group NSW into classes to deal with them

Page 7: Understandable Production of Massive Synthesis

General Improvements• Use formatting and structure to guide synthesis

– Title, articles, comments, ads, links, …– Emphasized in text → emphasized in spoken output– Expressive spelling

• Handle/ignore formatting problems– Missing HTML tags common– Improperly rendered HTML entities

• Don’t say “ampersand hash eight two one two semicolon”

Page 8: Understandable Production of Massive Synthesis

General Improvements• Content summarization

– How to present very long content?– Several ways to summarize

• Summarize articles and note existence of comments• Summarize articles and comments• Identify number of new articles and comments• More abstract ideas

– Subsetting• Speak enough of the content to allow the user to decide to hear

more or continue to the next item– Appropriate choice likely depends on user preferences

Page 9: Understandable Production of Massive Synthesis

General Improvements• Phrase boundaries and prosody

– Improved phrase breaks → more understandable synthesis – Effect amplified with informal writing

• “word soup”

• Multiple voices, non-speech output– Use different voices to segment content

• “narrator”, “male commenter”, “female commenter”, etc.– Single voice with multiple styles may work as well– Use non-speech sounds to render some tokens

• Laughing for “LOL” rather than trying to pronounce it

Page 10: Understandable Production of Massive Synthesis

Evaluation• Synthesis evaluation is challenging• Typically evaluate independent of domain• Requires human listeners

– Slow, expensive• Massive synthesis even harder

– Too much content for listeners to evaluate• Evaluating some content likely to help

– Especially if it’s chosen based on likelihood to have errors• Key to find as many errors as possible without listeners• Prioritize error correction

Page 11: Understandable Production of Massive Synthesis

Simple Study• Implement several modifications for “weblog text”

– “number-to-letter” rules– Syllable boundaries marked by case – “iTunes”– Lexical entries for common neologisms – “pwn”

• Synthesize typical massive synthesis content– Entries from blogs, random Wikipedia article, Blog06 data

• Subjects listen to 6 examples, one from each source• Asked to identify which version they prefer, and by

how much on a scale of 1-5• Subjects all speech synthesis experts

Page 12: Understandable Production of Massive Synthesis

Study Examples

STFU NewbMarch 14, 2006 8:02 AMCyberbullying Report. It's a Microsoft sponsored report talking about intimidation and

bullying online. There's a digested version of the survey [PDF]. And don't forget your dose of Cyber Wellness, too.

posted by gsb (13 comments total)

Does anyone else belive this just isn't happening? I mean back when I was a kid we tried to pwn eachothers IRC channels, but that was about it.

posted by delmoi at 8:13 AM on March 14

I absolutely believe this happens. Kids are f%*#ing mean. Girls are viscious to each other. I would be more surprised if kids weren't using technology to expand those behaviours than if they are.

posted by raedyn at 8:24 AM on March 14

[PDF]

beliveeachotherspwn

f%*#ing

posted by delmoi at 8:13 AM on March 14

posted by raedyn at 8:24 AM on March 14

Page 13: Understandable Production of Massive Synthesis

Results• All subjects always preferred

the modified examples• Less consistent agreement

in degree of preference• Generally low preference

scores– Implies only small

improvements over baseline– Average preference score

around 2 or 3– Strong preferences rare– Sample size too small

Example Min Avg Max1 1 2.2 32 1 3 43 1 1.8 24 1 2 35 2 3 56 2 3 4

Page 14: Understandable Production of Massive Synthesis

Discussion• Some fairly simple modifications result in speech

perceived at least slightly better• More changes might show more obvious preferences• Need more detailed information about how the speech

was perceived• Anecdotal feedback suggests improved prosody will

help significantly• Humans give hour-long lectures that people can

understand, how can synthesizers do that?

Page 15: Understandable Production of Massive Synthesis

Future Directions• Implement more understandability improvements

– Time constraints, content structure, etc.• Perform a more complete evaluation

– Not enough examples/listeners, but encouraging results• Need a more formalized evaluation metric

– User feedback within an application with interested users– Hard to find sufficient users who would participate

• Design an application to get users?– Web browser that renders content as speech: automatic

podcast generator

Page 16: Understandable Production of Massive Synthesis

Questions?

Page 17: Understandable Production of Massive Synthesis