Multi- and Cross-Lingual Dialog Systems - Uni Ulm … · Multi- and Cross-Lingual Dialog Systems...
Transcript of Multi- and Cross-Lingual Dialog Systems - Uni Ulm … · Multi- and Cross-Lingual Dialog Systems...
Multi- and Cross-Lingual
Dialog Systems
Alex Waibel and the InterACT Team
Carnegie Mellon University
Karlsruhe Institute of Technology
Mobile Technologies, LLC
Dialog Processing
• Human-Machine:
– Example: Human-Robot Interaction
– To Err is not only Human
– Multimodal Dialogs
• Human-Human:
– Example: Computers in the Human Interaction Loop
– Context Aware Agents
– Implicit and Explicit Interaction
• Human-Computer-Human
– Example: Cross-Lingual Communication
– Machine as Mediator
– Consecutive and Simultaneous
Multi-Modal Information
• To Err is Human
• Repair:
– Repair by Repeating is Singularly Ineffective
– Error Repair by Dialog
– Cross-Modal Repair
• Cross-Modal Repair
– Two Patents
Keyboard
Type Correction
Speech
Speak &Interpretation
“multimodal” Correction
Time
current keyboard-less Correction
Gestures for Editing and Partial Word Correction
Delete Words and Characters: Indicate Cursor Position:
Select Characters:
Partial Word Correction:
• Humanoid Robots
• SFB-588
– 10-Year
Research Center
• Joint Research:
– Robotics
– Multimodal
Perception
– Dialog
– Planning
Robuste Recognition
Speech Recognition
Without Buttons and Close Speaking Mics
• iPhone Input
• Aktive Listening
Robot Turns/Drives Closer
• MicArray
Small Improvements
• Distant Mic Processing
– Dereverberation
– Joint Particle Filter
Adaptation of Dialog Strategy
• Acoustic Adaption
• Dialog Adaptation Situationen, Objekts und Rolls
• Learning of Concepts
• Extended Behavior Network
Longitudinal Study: One Year
24/7 Operation
• Learning and Forgetting
• New Knowledge is Errorful
and Needs to be Forgotten
Dialog Strategy and 24/7 Interaction
Knowledge Mending: Personeneinträge in der
Datenbank beim Lernen über der Zeit
stabilizing
best performance
degrading
mending
interactions
[%]
Evaluation: person ID
“dynamic” interACT only
precision = correct labels
learned labels
initializing precision
f-measure
recall
#IDs in database
Web-Site f-measure
16
Human-Human Interaction Support
• CHIL – Computer in the Human Interaction Loop
– Rather than Humans in the Computer Loop
– Explicit Computing Complemented by Implicit Support
• Implicit Computing Services
– Support Human-Human Interaction Implicitly
– Increasingly Powerful Computing Services
– Implicit Services Observe Context and Understanding
– Reduction in Attention to Technological Artifact,
Increased Productivity
– Computer Learns from Human Activity Implicitly
• Visual
– Identity
– Gestures
– Body-language
– Track Face, Gaze, Pose
– Facial Expressions
– Focus of Attention
• Verbal:
– Speech
• Words
• Speakers
• Emotion
• Genre
– Language
– Summaries
– Topic
– Handwriting
“Why did Joe get angry at Bob about the budget ?”
Need Recognition and Understanding of Multimodal Cues
Interpreting Human Communication
We need to understand the: Who, What, Where, Why and How !
The CHIL Project
Logo Logo Logo
Universität Karlsruhe (TH)
Coordination:
– Scientific Coordinator: Univ. Karlsruhe, Prof. A. Waibel, R. Stiefelhagen
– Financial Coordinator: Fraunhofer IITB, Prof. Steusloff, K. Watson
The CHIL Team:
Sensors in the CHIL Room
Microphone
Array for Source-
Localization
(4 channels)
Screen
Camera
(fixed)
Pan-Tilt-Zoom
Camera
Microphone
Array
(64 channels)
Ceiling Mounted
Fish-Eye Camera
Stereo-Camera
Technologies/Functionalities
x
What does he
say?
What is his
environment? Where is he?
To whom does he
speak?
What is he
pointing
to?
Who is this?
Where is he
going to?
Technologies & Fusion
• Who & Where ? – Audio-Visual Person Tracking
– Tracking Hands and Faces
– AV Person Identification
– Head Pose / Focus of Attention
– Pointing Gestures
– Audio Activity Detection
• What ? (Input)
– Far-field Speech Recognition
– Far-field Audio-Visual Speech Recognition
– Acoustic Event Classification
• What ? (Output)
– Animated Social Agents
– Steerable targeted Sound
– Q&A Systems
– Summarization
• Why & How ?
– Classification of Activities
– Emotion Recognition
– Interaction & Context
Modelling
– Vision-based posture
recognition
– Topical Segmentation
Technologies/Functionalities
x
What does he
say?
What is his
environment? Where is he?
To whom does he
speak?
What is he
pointing
to?
Who is this?
Where is he
going to?
Results, June 2004
Speech Recognition • Close talking: 37% WER
• Far-field: 65% WER
Speech Detection • 9% Mismatch rate (CTM)
• 12.5% far field
Hand Tracking: • 73% correct
3D Pointing Gestures: • 75% Recall
• 77% Precision
x
Head Detection: • 78% correct (error < 15 pixel)
Head Orientation: • Mean error ca. 10°
Body Tracking: • 80,7% correct (error < 30 cm)
• mean error: 22 cm
Face Recognition (7 subjects) • 76% with manual alignment
•15% fully automatic
Speaker ID: • 100% correct, after 30s
Source Localization: • 11° root mean square error
Accoustic event classification
(25 classes) • 38,4% error
Evaluation: International Effort
• NIST and EC Programs Join Forces
– RT-Meeting’06 – Rich Transcription
• Emerges from established DARPA activity
• MLMI Workshops, AMI/CHIL
• Evaluated Verbal Content Extraction
• Chair: Garofolo (NIST)
– CLEAR’06 –
Classification of Locations, Events, Activities, Relationships
• Emerging from European program efforts (CHIL, etc.) and
US-Programs (VACE,..)
• First Joint Workshop to be Held in Europe
after Face & Gesture Reco WS, April 13 & 14, Southampton
• Chair: Stiefelhagen (UKA)
Human-Human Support Services
– Connector
• Connects people through the right device at the right moment
– Meeting Browser
• Create Corporate Memory of Events
– Memory Jog
• Unobtrusive service. Helps meeting attendees with information
• Provides pertinent information at the right time (proactive/reactive)
• Lecture Tracking and Memory
– Relational Report
• Informs the current speaker about interest/boredom of audience
• Coaches Meetings to be More Effective
– Socially Supportive Workspaces
• Physically shared infrastructure aimed at fostering collaboration
– Cross-Lingual Communication Services
• Detect Language Need and Deliver Services Inobtrusively
– … (and more)
Implicit Information Delivery
Private and Public Information Delivery
– CHIL phone
– Steerable Camera Projector
– Targeted Audio
– Retinal and Heads-Up Displays
The Connector
• Socially Appropriate Connection
– Connect People when Appropriate by Appropriate Media
• Connecting People depends on:
– Social Relationship of Parties
– Space / Environment
– Activity, User State
– Urgency of Matter
The Language Challenge
• Dilemma:
– Living in the Global Village
• Globalization, Global Markets
• Increased Exchange and Communication
• European/International Integration
– Cultural Diversity:
• Beauty, Identity, Language, Culture, Customs
• Pride and Individualism
• Language Ability
– Challenge:
• Providing Access to Global Markets and Opportunities
Maintaining Cultural Diversity/Individuality
Machine Interpretation
• Is Interpretation by Machine Possible?
– Yes, and Performance will Continue to Improve
• Is it Replacing Human Interpreters?
– No! Machine Translation Quality still Worse
Lacks Human Judgement and Intuition
– But: Human vs. Machine is Usually not the Choice we have!
Commonly, it is No Communication or Poor English
– Language Barriers are Pervasive and a Broad Social Challenge
• The Vision:
– Multi-Lingual Understanding & Integration for All
– Europe must Maintain & Nurture its own Diversity and Heritage
– Europe must Provide for its own Language Support
– We need to Embrace and Integrate Both Human and Machine Support
Technology
To Build a Speech Translator for a New Language
– 6 Component-Engines: Automatic Speech Recognition, Machine
Translation, and Text-to-Speech Synthesis
– Each is in Principle Language Independent,
but Requires Language Dependent Parameters/Models
– Models are Automatically Trained but Require Large Corpora
– Certain Language Dependent Peculiarities Exist
Speech Translation
Progression of Technologies:
– Domain Limited, Clear Speaking Style (late 80’s-91)
• Janus (first European&US speech-to-speech system)
• ATT, NEC, ATR
– Domain Limited, Spontaneous (‘91-’00)
• Janus II/III (work on 20 languages), Verbmobil, Nespole, Enthusiast, C-STAR, ATR, ETRI, NLPR,…
– Fieldable, Maintainable, Spontaneous
• Transtac, Babylon, Phraselator, Jibbigo, U-STAR
– Domain Unlimited Speech Translation
• Parliamentary Speeches (TC-STAR)
• Broadcast News (GALE)
• Lectures, Seminars (InterACT, STAR-DUST, TC-STAR)
Jibbigo Systems
• iTunes & Android App Stores:
– English, Spanish, French, German,
Japanese, Chinese, Korean, Filipino,
Iraqi, Thai, Pashto, Dari
• Cost:
– Free Jibbigo Online Translator
– Off-Line: Freedom from Network
• Outside of App Store:
– Other Languages in Preparation
– Enterprise Versions for Special
Applications
Supported Devices
Communication
• How it is Done Now:
– Human Interpreters
– Charts, Dictionaries
• Limitations/Problems:
– Limited Supply!!
– Fidelity/Trust/Security
– Number of Languages
Domain Unlimited
Domain Unlimited Translators for:
– TV/Radio Broadcast Translation
– Translation of Lectures and Speeches
– Parliamentary Speeches (UN, EU,..)
– Telephone Conversations
– Meeting Translation
你们的评估准则是什么
3. ---Seeing Personal Translations
• Technology: Heads-up Display Goggles
– Create Translation Goggles
– Run Real-Time Simultaneous Translation of Speech
– Text is Projected into Field of View of Listener
– Translations are Seen as Text Captions Under Speaker
– Output: Spanish, German,…
Hearing Personal Translations
• Technology: Targeted Audio
– Research under EC Project CHIL
(Build Inobtrusive Computer Services)
– Project Partner, Daimler-Chrysler
– Array of Ultra-Sound Speakers
• Result: Narrow Sound Beam
– Audible by one Individual Only
– Others not Disturbed
– Multiple Arrays Could
Provide Multiple Languages
– Steerable
– Recognize/Track Individual Listener
and Keep Language Beam on Target
Prof. Alex Waibel
University without Borders
English->Spanish Lectures
First Research-Prototype CMU 2005
German Lectures, KIT ‘10
Cloud Based Services, MobileTech,‘10
Transition to a Lecture Service
First Beginning: 4 Lectures, 2012
EU-BRIDGE
ASR
MT
Lecture 2
Components Services Events
Service Infrastructure
Adaptation,
Learning
New Improved
Technologies
Speech-
Services for
Users and
Developers
Search for Content
• Transcripts useful to Search for Content
– Slides, and Lectures in the Cloud
– Search in Lectures and Foils by Way of Search Terms
New Challenges
Simultaneous Translation of Lectures
• Continuous Monologue
– Broadcast News, Speeches, Lectures
• Speaking-Style
– Fast, spontaneous, fragmentary, and no punctuation!!
– Noise, Caughing, Singing (!)
• Vocabulary
– Much larger, Special Vocabularies
• Speed, Realtime
• Service-Infrastructure
– Many parallel lectures;
– Automatic, robust assignment of compute power
The German Lecture Translator
• MT in German Lectures is particularly hard. Why?
• Peculiarities of German:
– Wordorder:
Ich möchte mich zu der Konferenz über Maschinelle
Übersetzung anmelden
I want to register to the conference on Machine Translation
– Compounds:
Worterkennungsfehlerrate
Word Recognition Error Rate
– Inflections and Agreement:
Zu der nächsten wichtigen interessanten Vorlesung
Words, Words, Words…
• Technical Terms
normally not in ‘normal’ vocabularies
– Cepstral-Koeffizienten
– Wälzlagerungen Roller Bearings
– Unterraum Subspace
• Technical Terms
with special Meanings
– Klausur Final Exam (not Retreat)
– Vorzeichen Sign (not Omen)
• Formulas:
– Eff von Ix f(x)
Words, Words, Words….
• Foreign Words in German Language
– Computer Science, English Expressions
– Political Speeches, Latin Proverbs
• Accent
– “Würfelkalkül” (Asfour)
• Foreign Words in German Language
– “Cloud”, “iPhone”, “iPad”, “Laser”
• Inflections & Declinations of these Words
– Web-ge-casted, down-ge-loaded
• Formation of Compounds:
– Cloudbasierter Webcastzugriff
The Long Tail of Language
• Languages:
– Only a Few Languages are Currently Addressed (<10)
– Development of Technology Takes Long & Is Expensive
– Cost more than 1M $ and DevTime more than 1 Year per Language
– 6, 000 languages, 36 M potential language pairs, Plus Dialects
– Technology is Always a Step (or Two!) Behind Deployment