Post on 25-Feb-2016
description
Listening to the Gamer:Getting Speech Recognition in Games Right
Speaker Information Jason Hewitt
Advanced Technology GroupMicrosoft
Dr. Mike FroggattDeveloper LeadMicrosoft Game Studios
Audience Are you thinking about adding speech to your
game? Are you targeting a console or the PC?
Portable platforms are also good! Takeaway: Speech is on our consoles, it’s easy to
add.
Speech in games is not new! It was in Unreal Tournament 2004! It was on the PS2! It’s there, ready to use.
Two ways to listen General Dictation Command and Control
Multiple Solutions Fonix http://www.fonixspeech.com
• Platforms: Win32, Xbox 360, PS3, Wii• Languages: US English, UK English, French, German, Italian, Spanish, Japanese, Korean
Voxler http://www.voxler.eu• Platforms: Win32, Xbox 360, PS3, Wii, Nintendo DS, iPhone• Languages: “All major English dialects and European languages”
NuiSpeech • Kinect Only• Languages: English (US & UK), French, Japanese, Spanish (Mexico)
• Preview models of French (Canadian and France, German, Italian, Spanish (Spain) and English (Australian)• Designed specifically for a 10ft experience
Others are out there
Microphones OverviewPlatform Headset Mic Handheld Mic Mic Array/
Room MicIntegrated
PC ? ?Xbox 360 ? ? Kinect onlyPS3 ? ? Eye onlyWii ? Wii Speak onlySony NGP Yes3DS/DS/DSi YesSmart phones Yes
Each platform has its own microphones and platform capabilities, so you can either take the lowest common denominator or you can customize to each platform’s strengths
Speech has two Inputs
GrammarSpeech
Recognition Engine
Results
Deciding Speech’s Role in the Game
Apply Good Design Principles Set your goals at the beginning of the project: • Don’t add speech recognition with a month to spare Evaluate the tech
Prototype Rethink the goals
Be consistent Users expect what works once works always
Decide early if you want to count on a Mic Remember requiring Kinect or Move = free Mic
To require a mic? It’s natural! New gameplay mechanics Expands User Control Controller fallbacks not necessary
But are still a plus!
Or to not require a mic? Not everyone will have a Mic Accessibility Some gamers won’t or can’t talk
Menus and Pausing Don’t add at the last second! Think of your menu names and your grammar design at the same
time If you do implement it, let users skip menu pages Beware of a “Pause”
False positves can break flow Best to maybe gather intent Consider allowing users to disable
Key Scenario Integration Focused on scenarios in games that can provide
the biggest impact Dialog tree navigation Merchant/shop interaction
Most ideas here can again be optional; Allow the controller to be a back up
Full Title Integration Doesn’t mean voice only (but could) When approaching the games control scheme,
consider if voice makes sense—for example: Squad commands Activation of controls Help mechanisms Volume of player’s speech levels in a horror or stealth game
What can I say? Teaching styles
“See it! Say it!” then “Know it! Say it!” Repeat after me Explore on your own
Screen awareness Off-screen awareness
Expandable MenuSoldier! Joe
FrankAttackDefendRetreat
How’s the weather?
The Grammar
The Basics XML based
All use W3C format or a subset of the format http://www.w3.org/TR/speech-grammar/
Multiple rules can be activated or inactivated at once Custom pronunciations are available
This helps with in game items This can also help with difficult to pronounce or understand words
Grammar Size Check with your middleware provider on how many phrases
Key point is going beyond recommended phrases means more chances to be similar and confusable
Manage active phrases with rules Remember you don’t need the shopkeeper recognition when
fighting the dragon Pause menu interaction should reduce the set of active rules
Evolving the Grammar Start with a small initial word set
Do no proactively add recognition phrases too much See through play testing where gamers go Handle the common cases
Synonyms are a slippery slope Especially in a See it! Say it! scenario
Multiple iterations provides better tuning
A Work in Progress
Design
ImplementTest
Test Each Iteration! Record your users saying phrases both in and out of grammar Consider automated nightly tests of each grammar iteration Measure false negatives, false positive, success rates Test in game scenarios
If two grammars are active at the same time, you must test them together
Working with Limitations Speech is not perfect Generally speech works best when
Background noise is minimum Speaker enunciates The grammar size is within recommendation
Working with Side talk There may be other noises in the room that the
mic picks up Remember you can still respond to side talk!
“Hey, you talking to me?”“Sorry, my (language) is very limited.”
Test with a garbage rule
Working with Failure Even a speech recognition failure should be a
success Handle misfires and repeats as part of the game
NPCs can have headaches, migraines, or explain their misunderstanding
“Sorry, what was that? I was thinking about sheep.”
Localization Begin localization after most design decisions are locked down
Iterate and design in your native language Begin before it’s too late to work with translators, manual, etc. Be wary of text/UI translators
Spoken language can vary differently than the written language Recommend audio translators
Leverage your existing in-game dialogue translation team They know the right voice to use for communication “See it! Say it!” implementations will need to be translated by this same team
Have native speakers testing More than one native speaker is always better
Localization Provide plenty of background of the situation to the
translator. More info the better. You should be doing this for in-game dialogue already; your
team’s localization expert will be able to provide guidance here. Different languages map 1 word to 3 words and 3 to one so
provide context for each situation Remember to coordinate changes across languages
Listening to the Gamer:Getting Speech Recognition in Games RightKinectimals Speech Post-Mortem
“If I could talk with the animals…” Kinectimals was standard-bearer for speech recognition at Kinect
launch Lofty goals:
Natural interaction with animal through speech Praise, issue commands, call animal by name
Ultimately delivered robust recognition for a reasonable command set
Animal naming most challenging component to implement
Goal: Perfect Feline Behaviour
Design<grammar
xml:lang="en-us
" version="1.0"
root="dash_comma
nds">
<rule id="dash
_commands" scope
="public">
<one-of>
<item>
Hey, is
this thing on? X
box, can you hea
r me? Hey Jimmy!
Come look at th
is! The Xbox
understands me!
<tag>
exec "
dash.xex /upgrad
e_to_gold_accoun
t /quiet"
</tag>
</item>
<item>
Oh Xbox,
you’re my only
friend - my girl
friend’s left me
and no one unde
rstands me like
you
do.
<tag>
exec "
halo_reach.xex"
</tag>
</item>
</one-of>
</rule>
</grammar>
Design Giveth… Game design is our friend
No expectation of animals understanding speech perfectly
Player more forgiving of incorrect or failed recognition Children interpreted failed recognition as animal
“being naughty”
…Design Taketh Away Design is our enemy
Familiar situation produces habitual response Expectation that what a real animal responds to, the game will respond
to Commands framed with non-essential vocal noise
“Hey Skittles, sit down, please” Speech commands often mode-less
Where to allow / disallow them?
Don’t Both Talk at Once Narrator character introduced late in design
Gave instruction on gestures and speech commands to use Narrator saying “Sit down” often made animal sit down Specific hardware can help with this
Kinect has array microphone with Multichannel Echo Cancellation (MEC) Effectiveness dependent on microphone calibration
Better to avoid issue altogether if possible Disable speech recognition while narrator speaking Watch out for NPC speech triggering commands during gameplay
Example: team-mates shouting “Take cover!”
Implementation
Final Grammar Most complex command grammar:
Concurrent detection of 16 different phrases Mapped to 9 distinct commands (“Sit” equivalent to “Sit down”)
Name recognition also running Some state-based selection of different grammars
However this was worst-case scenario (most rules active) Manually specifying phonemes for a given rule can help increase recognition
accuracy May be needed for proprietary or game-specific terms like character names
Built-in text-to-phoneme rules may not work well in these cases
<rule id="reserved" scope="public"> <one-of> <item> <token sapi:display="Kinect" sapi:pron="K IH N EH K T"> kinect </token> </item> </one-of></rule>
Playing <tag> <tag> element allows a single
semantic to be associated with multiple utterances
Also provides language invariance Great way to encode per rule data
Accept confidence threshold, for example
Parsed at run-time, so don’t go overboard
<item> <one-of> <item> sit </item> <item> sit down </item> </one-of> <tag>Sit</tag></item>
<item> <one-of> <item> go play <tag>conf=0.45</tag> </item> </one-of> <tag>Dismiss</tag></item>
Please Stop Talking Speech is unpredictable
Valid utterances may vary widely in length Background noise may end up being processed for recognition
Changing state of Speech Recognition engine may incur unexpected synchronization delays Can occur when stopping recognition, changing rule states or loading new grammars
Bugs can become highly context-sensitive May see occasional frame-outs when tested in noisy open-plan area, but not when tested in closed
office Easiest option: run all game-side speech processing on separate thread
Move off the h/w thread that the main game is using Speech will typically not saturate a core
Name Your Animal (NYA) Allow player to speak name they want to use No attempt to turn spoken name into real text
(for display) Instead use a pictorial (camera capture)
representation for identification Implemented as free form speech to phoneme
conversion Then use phonemes to build a grammar rule
with custom pronunciation Name used to attract animal’s attention, just as
it would be in real life Pushes the limits of NuiSpeech
NYA Challenges Used a special grammar for speech to phoneme conversion Much larger than normal command grammars
11.5MB for largest NYA grammar vs. 5KB for largest command grammar Also requires a dynamic grammar to add the “name” rule to
So even more memory for the acoustic model Much more sensitive to environmental noise than the normal speech commands
Naming process would sometimes drive itself to completion from noise in the room Watched for some reserved terms (“Kinect”, “Xbox”), no attempt to catch swearing etc.
Space of potentially prohibited terms simply too large Reject names that are too long as difficult for the player to repeat successfully
NYA Flow One utterance unlikely to be sufficient to get the
“right” name Allow a number of attempts to successfully repeat
name Hopefully deals with player trying to mislead the system
If no repeat in sensible number of attempts, prompt player to try different name
Try to avoid player getting stuck trying to repeat “problem” name
Balance ease of use when player is using the system “correctly” against rejecting noise as a naming attempt
“CH I Z AX N” generated
“CH I T AX” ideal string
NYA Internationalization Separate speech to phoneme grammar for each NuiSpeech
language NYA accuracy varies across languages
US English NYA works well for languages other than English Tested in 11 additional countries
Allowed us to support NYA in countries that weren’t supported by NuiSpeech at launch
Testing
The Challenge of Testing Speech Human beings very good at spotting patterns
Even non-existent ones Easy to find reasons why speech works better or worse
“Speech works better when I wear a blue shirt!” In reality, recognition strongly influenced by exact acoustic
environment So test with lots of people, and lots of different conditions
Individual office vs. open plan Look at whether player successfully completes tasks with speech
Not just whether individual commands are recognised (too conservative) Watch out for commands that never seem to work however!
Make low-level speech success / failure events visible On-screen log is very useful
Heed the Advice of W. C. Fields Never work with children or animals Kinectimals had both… Recognition confidences for children inherently lower than for adults Can be self-conscious about “talking to the TV” leading to them not speaking clearly If they become frustrated, they may shout or do other things that make recognition
worse, not better Tutor them through which speech commands to use, and how best to say them Set confidence thresholds lower and accept some degree of False Accepts for adult
speakers This can be difficult since your test / development team will get a worse experience
What We Learnt Integration of speech recognition system straightforward (even with NYA)
But testing hard and time-consuming! Look at task completion, not purely at recognition accuracy
Players will probably not notice occasionally having to repeat commands Contrast issuing commands to the game, versus talking to an in-game character
Issuing commands: small command set, but very high accuracy required Talking to character: more tolerant of failed recognition, but larger command set, or even natural language expected
Naming things via speech is hard You probably won’t have access to generic speech-to-text capabilities If you can, use text input to acquire the name and then add it dynamically as a grammar rule
You may want a custom lexicon of common / difficult names to ensure correct phonemes used Accept you may not be able to please everyone all the time
Weight success towards your primary audience
Thank you to… Xbox Platform Speech Team Kinectimals Team at Frontier
No animals were harmed in the making of this game A few testers lost their voices however