© 2010 The University of Sheffield Spoken Language …roger/publications/RKM Keynote - Tartu -...
Transcript of © 2010 The University of Sheffield Spoken Language …roger/publications/RKM Keynote - Tartu -...
1
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 1
Kõnekeele Töötlemine
Prof. Roger K. MooreProf. Roger K. Moore
Chair of Spoken Language Processing
Dept. Computer Science, University of Sheffield, UK
(Visiting Prof., Dept. Phonetics, University College London)
(Visiting Prof., Bristol Robotics Laboratory)
Homse päeva tehnoloogia täna või …
tänapäeva tehnoloogia homme?
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 2
Spoken Language Processing
Prof. Roger K. MooreProf. Roger K. Moore
Chair of Spoken Language Processing
Dept. Computer Science, University of Sheffield, UK
(Visiting Prof., Dept. Phonetics, University College London)
(Visiting Prof., Bristol Robotics Laboratory)
tomorrow’s technology today or …
today’s technology tomorrow?
2
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 3
Spoken Language Processing
Spoken Language Dialogue Systems
X
XAutomatic Speech
Recognition
XText-to-Speech
Synthesis
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 4
Plenty of Applications …in Science Fiction!
3
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 5
Why’s Anyone Interested?
• hands-free
• eyes-free
• fast
• intuitive“You have been learning since birth the
only skill needed to operate our equipment.”
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 6
Today’s Technology• The majority of mobile phones have voice
dialling
• Software for dictating documents is now available as standard on a PC
• Interactive Voice Response (IVR) systems are becoming commonplace for handling telephone enquiries
• Voice-based search is now available on your phone
“Speech is the most “Speech is the most naturalnatural mode for mode for humanhuman--machine interaction”machine interaction”
4
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 7
Speaker-Dependent Large-Vocabulary Continuous Speech Recognition
Document Dictation
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 8
Voice Dictation
5
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 9
Speaking Information
Expressive Text-to-Speech Synthesis
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 10
Speech Recognition ⇒ Text Translation ⇒ Speech Synthesis
Speech-to-Speech Translation
6
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 11
2 20 200 2000 20000 Unrestricted
Spontaneousspeech
Readspeech
Fluentspeech
Connectedspeech
Isolatedwords
VOCABULARY SIZE (#words)
SP
EA
KIN
G S
TY
LE
1980
1990
2000
system drivendialogue
officedictation
namedialing
form fillingby voice
directoryassistance
wordspotting
digitstrings
voicecommands
Adapted (with permission) from an original figure by Prof. Sadaoki Furui, TIT, Japan.
Recent Progress
2010
naturalconversation2-way
dialogue
transcriptionnetworkagent &
intelligentmessaging
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 12
Where we are Today• There is steady year-on-year progress
• Improvements come from:– corpus-driven statistical modelling approach– increase in available computer power– public benchmark testing
• Progress has not come about as a result of deep insights into human spoken language
• Spoken language technology is – fragile (in ‘real’ conditions)– expensive (to port to new applications / languages)
• Performance appears to be reaching an asymptote that is well short of human abilities– 20-50% word error rate on conversational speech
��������
☺☺☺☺☺☺☺☺
��������
☺☺☺☺☺☺☺☺
��������
7
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 13
It’s Not Natural to Talk to a Machine!
��������
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 14
Speech Processing can be Difficult
��������
8
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 15
Human SR vs Machine SR
0.001
0.01
0.1
1
10
100
Connected
Digits
Alphabet
Letters
Resource
Management
Wall Street
Journal
Business
News
Switchboard
Wo
rd E
rro
r R
ate
(%
)
ASR
Human
0.001
0.01
0.1
1
10
100
Connected
Digits
Alphabet
Letters
Resource
Management
Wall Street
Journal
Business
News
Switchboard
Wo
rd E
rro
r R
ate
(%
)
ASR
Human
Taken from Lippmann, R. P. (1997). Speech recognition by machines and humans. Speech Communication, 22, 1-16.
��������
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 16
Human SS vs Machine SS
��������
9
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 17
Tomorrow’s Technology
“It’s hard to predict …”
“… especially the future!”
Niels Bohr (1922)
When will it be good enough?
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 18
Ray Kurzweil’s Predictions
18
“2009: The majority of text is created using continuous speech recognition.”
“2019: Most interaction with computing is
through gestures and … spoken communication.”
10
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 19
Results of a Survey
ASRU97
Moore, R. K. (2005). Results from a survey of attendees at ASRU 1997 and 2003, INTERSPEECH. Lisbon.
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 20
Overall Results
2009 2003 1997
Respondents: 81
Overall Median:(1st 12 statements)
2010
“Never”s:(1st 12 statements)
17%
Named
responses:22%
ASRU97
11
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 21
Overall Results
2009 2003 1997
Respondents: 105 81
Overall Median:(1st 12 statements)
2020 2010
“Never”s:(1st 12 statements)
22% 17%
Named
responses:4% 22%
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 22
Overall Results
2009 2003 1997
Respondents: 127 105 81
Overall Median:(1st 12 statements)
2028 2020 2010
“Never”s:(1st 12 statements)
22% 22% 17%
Named
responses:21% 4% 22%
12
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 23
6 Years!
Another 6 Years!
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 24
Year: 2009 2003 1997
Median: 2015 2008 2002
SD: 109 10 4
Min: 2001 2000 1998
Max: 3220 2060 2020
13
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 25
Year: 2009 2003 1997
Median: 2022 2010 2007
SD: 10 10 57
Min: 2001 2002 1999
Max: 2080 2050 2500
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 26
Year: 2009 2003 1990
Median: N 2100 2009
SD: 184 48
Min: 2010 2000
Max: 3000 2300
14
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 27
Year: 2009 2003 1997
Median: 2035 2030 2020
SD: 310 212 124
Min: 2010 2005 1997
Max: 5000 3827 3001
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 28
Year: 2009 2003 1999
Median: 2100 2053 2019
SD: 25 225
Min: 1960 2004
Max: 2140 3827
15
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 29
Are We Making Progress?Perfect
Knowledge
Perfect Ignorance
Past Present Future
inreasing steadily?
saturating?
accelerating?
falling!
“Speech is the most “Speech is the most
sophisticated behaviour of the sophisticated behaviour of the
most complex organism in the most complex organism in the
known universe”known universe”(Dawkins, 1991 + (Dawkins, 1991 + GopnikGopnik et al, 2001)et al, 2001)
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 30
Whither Tomorrow’s Technology?• We need a radical re-think of our approach(es)
• The main challenge is to account for the immense variability that appears to be inherent in speech
• We appear to be suffering from a devastating lack of relevant priors and conditional dependencies
• Scientific-reductionism has lead us to treat recognition, synthesis and dialogue independently
• The wider communicative function of speech tends to be ignored
• Systematic behaviour resulting from speaker-listener interaction is thus observed (and modelled) as random variation!
16
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 31
The Way Forward?• Spoken language needs to be treated as an
intimate part of an artificial cognitive system, not simply a fancy interface technology
• Ultimately, speech recognition, speech generation and conversational interaction cannot be modelled independently
• Speech needs to be modelled as a behaviour that is conditioned on communicative contextwhere:– the talker has in mind the needs of the listener(s)– the listener has in mind the intentions of the talker(s)
Moore, R. K. (2007). Spoken language processing: piecing together the puzzle. Speech Communication, 49, 418-435.
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 32
The Speech Chain
Taken from ‘The Speech Chain: The Physics and Biology of Spoken Language’, P. B. Denes & E. N. Pinson, New York: Anchor Press, 1973.
XX
17
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 33
The Communicative ‘Loop’
Moore R K. 'Cognitive Informatics: The Future of Spoken Language Processing?', Keynote talk, SPECOM - 10th Int. Conf. on Speech and Computer, Patras, Greece, 17-19 October (2005).
��
HUMAN MACHINE
Sensorimotor
Feedback
Control
Process
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 34
A ‘Cognitive’ Approach to SLP
Predictive Sensorimotor Control & Emulation: PreSenCE
sensory
input(s)
motor
output(s)
S
BP
multi-modal
input(s)
multi-modal
output(s)
ni
ei
ai
+-
PB
ej aj
njU
+-
disturbance(s)
System
User
Moore, R. K. (2007). PRESENCE: A human-inspired architecture for speech-based human-machine interaction. IEEE
Trans. Computers, 56(9), 1176-1188.
System’s needs
User’s needs
The system’s needs should be:
“to satisfy the user’s needs”
18
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 35
PreSenCE
S:i S:mx-x -
S:E(U:m)
S:E(U:E(S:m))
S:E(U:m)
S:E(U:E(S:i))
S:E(U:i)
-
-
x
S:E(U:n)
-
S:n
-mo
tivation
feeling
sensitivity
atte
ntion
interpretation
actionneeds
nois
e, dis
tort
ion, re
action, dis
turb
ance
intention
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 36
PreSenCE
S:i S:mx-x -
S:E(U:m)
S:E(U:E(S:m))
S:E(U:m)
S:E(U:E(S:i))
S:E(U:i)
-
-
x
S:E(U:n)
-
S:n
-mo
tivation
feeling
sensitivity
atte
ntion
interpretation
actionneeds
nois
e, dis
tort
ion, re
action, dis
turb
ance
intention
Primary route for System’s
motor behaviour
19
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 37
PreSenCE
S:i S:mx-x -
S:E(U:m)
S:E(U:E(S:m))
S:E(U:m)
S:E(U:E(S:i))
S:E(U:i)
-
-
x
S:E(U:n)
-
S:n
-mo
tivation
feeling
sensitivity
atte
ntion
interpretation
actionneeds
nois
e, dis
tort
ion, re
action, dis
turb
ance
intention
System’s emulation of user’s possible motor
actions
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 38
PreSenCE
S:i S:mx-x -
S:E(U:m)
S:E(U:E(S:m))
S:E(U:m)
S:E(U:E(S:i))
S:E(U:i)
-
-
x
S:E(U:n)
-
S:n
-mo
tivation
feeling
sensitivity
atte
ntion
interpretation
actionneeds
nois
e, dis
tort
ion, re
action, dis
turb
ance
intention System’s emulation of User’s emulation of System
20
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 39
PreSenCE
S:i S:mx-x -
S:E(U:m)
S:E(U:E(S:m))
S:E(U:m)
S:E(U:E(S:i))
S:E(U:i)
-
-
x
S:E(U:n)
-
S:n
-mo
tivation
feeling
sensitivity
atte
ntion
interpretation
actionneeds
nois
e, dis
tort
ion, re
action, dis
turb
ance
intention
System’s interpretation
of User behaviour
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 40
PreSenCE
S:i S:mx-x -
S:E(U:m)
S:E(U:E(S:m))
S:E(U:m)
S:E(U:E(S:i))
S:E(U:i)
-
-
x
S:E(U:n)
-
S:n
-mo
tivation
feeling
sensitivity
atte
ntion
interpretation
actionneeds
nois
e, dis
tort
ion, re
action, dis
turb
ance
intention
Affective signals drive the control
loop(s)
21
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 41
Implications for SLP• A new model of speech generation that:
– selects its characteristics appropriate to the needs of the listener
– monitors the effect of its own output– modifies its behaviour according to its internal model of
the listener
• A new model of speech recognition that:– uses a forward/generative model based on an internal
emulation of the intentions of the speaker– adapts its forward/generative model to the voice of the
speaker based on knowledge of its own voice
• A new model of dialogue that:– is driven by the need to satisfy the user(s)’ needs– uses emotion to appraise its success
• No contemporary SLP systems exploit such sensorimotor integration, overlap & control
Moore, R. K. (2007).
PRESENCE: A human-inspired architecture for speech-based
human-machine interaction. IEEE Trans.
Computers, 56(9), 1176-
1188.
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 42
PreSenCE-Related Research
• ACORNS– Acquisition of Communication and
Recognition Skills
• SERA– Social Engagement with Robots and Agents
• S2S– Sound to Sense
• SCALE– Speech Communication with Adaptive
Learning
• COMPANIONS– Intelligent, Persistent, Personalised
Multimodal Interfaces to the Internet
22
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 43
Synthesis-by-Analysis
• Speech: HMM-based synthesis (HTS)
• Noise: passing vehicle• SNR: - 9.2 dB → + 7.8 dB
Mauro Nicolao
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 44
Speech Energetics
Robin Hofe
Hofe, R., & Moore, R. K. (2008). Towards an investigation of speech energeticsusing 'AnTon': an
animatronic model of a human tongue and
vocal tract. Connection Science,
20(4), 319–336.Animatronic tongue being
developed at Sheffield University
23
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 45
‘AnTon’ – Animatronic Tongue
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 46
A Glimpse of the Future?A system with needs!
24
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 47
Year: 2009 2003 1997
Median: N N N
SD: 93 1308 546
Min: 2009 1981 1984
Max: 3000 9999 5001
© 2010 The University of Sheffield
Estonian HLT Conference - Tartu - 25-26 Nov. 2010 slide 48
Aitäh kuulamast!
Tomorrow’s technology today or …
today’s technology tomorrow?
You decide!