Understandable Production of Massive Synthesis

Understandable Production of Massive

SynthesisBrian Langner, Alan W Black

Language Technologies Institute

Carnegie Mellon University

Background• Applications pushing the limits of speech synthesis

becoming more common– …

• Issues with perceived quality and understandability arise more frequently with more challenging uses

• Successful speech applications require understandable output

• What can we do to make speech synthesis more understandable, even in challenging applications?

Massive Synthesis• Synthesis of such a large amount of content that

typical evaluation methods are impractical• Frequently characterized by continuously generating

new content• Examples:

– Error reports– Business case summaries– News readers– Weblogs

• Often able to simplify task, though not always

Example Content• Weblog content is ideal

– Copious amounts available, continually generated• Easy to collect from many weblogs

– ... but is it representative?• Use an existing collection: TREC Blog06 corpus

– 11 weeks of over 100,000 RSS/Atom feeds from 2005-06– Consists of homepage + all new permalink pages weekly– Over 750,000 total collected feeds– Includes spam content for realism– Multilingual, though only concerned with English for now

Analyzing Massive Synthesis Content

• Preprocess to remove tags and meta content– Resulting corpus is 14GB of “blog” text

• Identify word frequency differences from typical English– Try to find “blog-frequent” words unlikely to synthesize well– Flag and target for improvement strategies

• Most words fairly normal for English– Frequency for differs, but words are not unusual– Most frequent atypical words: “html”, “blog” (27th/28th)

• High occurrence of acronyms– “FAQ”, “mp3s”

Common Problems• Prevalence and variety of non-standard words

– Technical jargon– Typos / Spelling errors

• “the-teh”, “lose-loose”, “voila-viola”, etc.– l33t5p33k– Expressive spelling

• “soooooo…..”– Usernames/handles

• Must be rendered understandably to be useful– “leet” rather than “el-three-three-tee”

• Can group NSW into classes to deal with them

General Improvements• Use formatting and structure to guide synthesis

– Title, articles, comments, ads, links, …– Emphasized in text → emphasized in spoken output– Expressive spelling

• Handle/ignore formatting problems– Missing HTML tags common– Improperly rendered HTML entities

• Don’t say “ampersand hash eight two one two semicolon”

General Improvements• Content summarization

– How to present very long content?– Several ways to summarize

• Summarize articles and note existence of comments• Summarize articles and comments• Identify number of new articles and comments• More abstract ideas

– Subsetting• Speak enough of the content to allow the user to decide to hear

more or continue to the next item– Appropriate choice likely depends on user preferences

General Improvements• Phrase boundaries and prosody

– Improved phrase breaks → more understandable synthesis – Effect amplified with informal writing

• “word soup”

• Multiple voices, non-speech output– Use different voices to segment content

• “narrator”, “male commenter”, “female commenter”, etc.– Single voice with multiple styles may work as well– Use non-speech sounds to render some tokens

• Laughing for “LOL” rather than trying to pronounce it

Evaluation• Synthesis evaluation is challenging• Typically evaluate independent of domain• Requires human listeners

– Slow, expensive• Massive synthesis even harder

– Too much content for listeners to evaluate• Evaluating some content likely to help

– Especially if it’s chosen based on likelihood to have errors• Key to find as many errors as possible without listeners• Prioritize error correction

Simple Study• Implement several modifications for “weblog text”

– “number-to-letter” rules– Syllable boundaries marked by case – “iTunes”– Lexical entries for common neologisms – “pwn”

• Synthesize typical massive synthesis content– Entries from blogs, random Wikipedia article, Blog06 data

• Subjects listen to 6 examples, one from each source• Asked to identify which version they prefer, and by

how much on a scale of 1-5• Subjects all speech synthesis experts

Study Examples

STFU NewbMarch 14, 2006 8:02 AMCyberbullying Report. It's a Microsoft sponsored report talking about intimidation and

bullying online. There's a digested version of the survey [PDF]. And don't forget your dose of Cyber Wellness, too.

posted by gsb (13 comments total)

Does anyone else belive this just isn't happening? I mean back when I was a kid we tried to pwn eachothers IRC channels, but that was about it.

posted by delmoi at 8:13 AM on March 14

I absolutely believe this happens. Kids are f%*#ing mean. Girls are viscious to each other. I would be more surprised if kids weren't using technology to expand those behaviours than if they are.

posted by raedyn at 8:24 AM on March 14

[PDF]

beliveeachotherspwn

f%*#ing

posted by delmoi at 8:13 AM on March 14

posted by raedyn at 8:24 AM on March 14

Results• All subjects always preferred

the modified examples• Less consistent agreement

in degree of preference• Generally low preference

scores– Implies only small

improvements over baseline– Average preference score

around 2 or 3– Strong preferences rare– Sample size too small

Example Min Avg Max1 1 2.2 32 1 3 43 1 1.8 24 1 2 35 2 3 56 2 3 4

Discussion• Some fairly simple modifications result in speech

perceived at least slightly better• More changes might show more obvious preferences• Need more detailed information about how the speech

was perceived• Anecdotal feedback suggests improved prosody will

help significantly• Humans give hour-long lectures that people can

understand, how can synthesizers do that?

Future Directions• Implement more understandability improvements

– Time constraints, content structure, etc.• Perform a more complete evaluation

– Not enough examples/listeners, but encouraging results• Need a more formalized evaluation metric

– User feedback within an application with interested users– Hard to find sufficient users who would participate

• Design an application to get users?– Web browser that renders content as speech: automatic

podcast generator

Questions?

Understandable Production of Massive Synthesis

Documents

Transcript of Understandable Production of Massive Synthesis