Post on 09-Jun-2018
1
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 1
Discovering the Particulate Structure of Speech
Prof. Roger K. MooreProf. Roger K. Moore
Chair of Spoken Language Processing
Dept. Computer Science, University of Sheffield, UK
(Visiting Prof., Dept. Phonetics, University College London)
(Visiting Prof., Bristol Robotics Laboratory)
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 2
Overview
• Human versus machine speech recognition
• Developmentally-inspired ASR
• Research conducted in the EU-FP6 ACORNS FET project
• The particulate structure of speech
• Phylogenetic and ontogenetic perspectives
• The role of the production system
• Relevant research at USFD
2
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 3
Human SR vs. Machine SR
0.001
0.01
0.1
1
10
100
Connected
Digits
Alphabet
Letters
Resource
Management
Wall Street
Journal
Business
News
Switchboard
Wo
rd E
rro
r R
ate
(%
)ASR
Human
0.001
0.01
0.1
1
10
100
Connected
Digits
Alphabet
Letters
Resource
Management
Wall Street
Journal
Business
News
Switchboard
Wo
rd E
rro
r R
ate
(%
)ASR
Human
Taken from Lippmann, R. P. (1997). Speech recognition by
machines and humans. Speech Communication, 22, 1-16.
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 4
Human SR vs. Machine SR
• What’s going on here?
• The definition of ‘recognition’ in machine SR is fundamentally correct …– “the most likely explanation of the incoming
data given a model of how it was produced”
• Any shortfalls in performance must therefore be due to ...– insufficient fidelity of the data– having the wrong model
• ASR researchers have been investigating both for ~60 years
3
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 5
0
10
20
30
40
50
60
70
0 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000
Hours
Wo
rd E
rro
r R
ate
(%
)
Supervised Unsupervised Unsupervised (reduced LM training)
0
10
20
30
40
50
60
70
0 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000
Hours
Wo
rd E
rro
r R
ate
(%
)
Supervised Unsupervised Unsupervised (reduced LM training)
Human SR vs. Machine SR80 year-old80 year-old10 year-old10 year-old >70 lifetimes>70 lifetimes2 year-old2 year-old
Moore, R. K. (2003). A comparison of the data requirements of automatic
speech recognition systems and human listeners, EUROSPEECH03. Geneva.
Human
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 6
Human SR vs. Machine SR
• What’s going on here?
• From an ML perspective …
– wrong type of data?
– underusing the data?
– lack of suitable priors?
• Answer = all three!
4
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 7
Human SR vs. Machine SR
Human Machine
Learning incremental one-shot
Contextrich
(situated & embodied)
poor(domain-specific)
Styleconversational & communicative
formal & performed
Priors acquisition device AM & LM structure
Structure constructed calibrated
Memorydynamic
(episodic & semantic)
static(probabilistic)
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 8
Developmentally-Inspired ASR
• These key differences have inspired a number of investigations into the possibility of an artificial embodied agent acquiring spoken language through incremental learning in a situated environment
• The classic study was published by Deb Roy in 1998
• In December 2006 the EU funded a 3-year Future and Emerging Technology project called ‘ACORNS’ (Acquisition of COmmunication and RecogNition Skills)
http://www.acorns-project.org/
5
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 9
‘Little ACORNS’ (LA)
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 10
ACORNS Memory Architecture
ten Bosch, L., Van
hamme, H., Boves,
L., & Moore, R. K.
(2009). A
computational model of language
acquisition: the emergence of
words.
Fundamenta
Informaticae, 90,
229-249.
6
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 11
ACORNSPattern
Discovery Algorithms
Acoustic DP-Ngrams
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 12
Acoustic DP-Ngrams
Aimetti, G., &
Moore, R. K.
(2009). Discovering
keywords from
cross-modal input: ecological vs.
engineering methods for
enhancing
acoustic
repetitions,
INTERSPEECH. Brighton, UK.
7
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 13
10 20 30 40 50 60 70 80
10
20
30
40
50
60
70
80
90
100
110 0
5
10
15“Ew
an s
its o
n th
e c
ouch”
“Ewan is shy”
Zit9?m\
Acoustic DP-Ngrams
Aimetti, G., &
Moore, R. K.
(2009). Discovering
keywords from
cross-modal input:
ecological vs.
engineering
methods for
enhancing
acoustic
repetitions,
INTERSPEECH.
Brighton, UK.
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 14
Episodic Traces
2 14 3 7 1 28 11 12 18 29 16 22 19 17 5 6 9 10 4 23 24 27 25 26 20 13 8 15 21 30
5
10
15
20
25
30
Dendrogram of Exemplar units Within Internal Class DUCK
Exemplar Index
Min
-Cost
Dis
tance
2 14 3 7 1 28 11 12 18 29 16 22 19 17 5 6 9 10 4 23 24 27 25 26 20 13 8 15 21 30
5
10
15
20
25
30
Dendrogram of Exemplar units Within Internal Class DUCK
Exemplar Index
Min
-Cost
Dis
tance
“duck” “theduck” “the” “is”
Exemplar Units
8
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 15
“nappy” “book”
“shoe” “bath”
“daddy” “car”
“telephone” “mummy”
“Ewan” “bottle”
Pattern Discovery(after 100 utterances)
‘objects’ emerging from audio-visual
pattern discovery
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 16
Word Recognition
9
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 17
Epigenetic Landscape
Aimetti, G., ten
Bosch, L., &
Moore, R. K.
(2009). Modelling
early language
acquisition with a
dynamic systems
perspective, 9th
Int. Conf. on
Epigenetic
Robotics. Venice.
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 18
Effect of Fetal Hearing
Aimetti, G., &
Moore, R. K.
(2009). Discovering
keywords from
cross-modal input:
ecological vs.
engineering methods for
enhancing
acoustic
repetitions,
INTERSPEECH. Brighton, UK.
10
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 19
Whole Words → Sub-Words
Time-frequency ‘patches’ derived using ‘non-negative matrix factorisation’ (NMF)
Van Segbroeck, M., & Van hamme, H. (2009). Unsupervised learning
of time-frequency patches as a noise-robust representation of
speech. Speech Communication, 51(11), 1124-1138.
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 20
Whole Words → Sub-Words
Parsing words using NMF-based sub-word structure
Van Segbroeck, M., & Van hamme, H. (2009). Unsupervised learning of time-frequency patches as a noise-robust representation of
speech. Speech Communication, 51(11), 1124-1138.
11
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 21
Towards a General Principle
• It is not enough simply to ‘decompose’ speech into a hierarchy of seemingly arbitrarily units
• There needs to be an underlying driving principle for the existence (and hence learning) of such structure
• One candidate is ‘the particulate principle of self-diversifying systems’ (Abler, 1989)
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 22
Self-Diversifying Systems
Abler, W. L. (1989). On the particulate principle of self-
diversifying systems. Social Biological Structures, 12, 1-13.
+ →‘Blending’ constituents
→+‘Particulate’ constituents
12
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 23
Self-Diversifying Systems
• Examples …– chemical interaction– biological inheritance– human language
• All such systems “make infinite use of finite means” (Humbold, 1836)
• Properties– multidimensional– hierarchical– periodic
Abler, W. L. (1989). On the particulate principle of self-
diversifying systems. Social Biological Structures, 12, 1-13.
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 24
The Particulate Structure of Speech• Grounded in …
– sensorimotor channels– drives and intentions
• Structure is constructed …– phylogenetically– ontogenetically
Abler, W. L. (1989). On the particulate principle of self-
diversifying systems. Social Biological Structures, 12, 1-13.
• Emergent structures …– pragmatic– semantic– syntactic– lexico-morphemic– phonological– articulatory
13
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 25
The Particulate Structure of Speech• Abler noted that the physical basis of human speech is
fundamentally different from that of biological inheritance or chemical systems
• Consecutive speech gestures and their consequent acoustic signals exhibit blending
• This increases the length of time during which information concerning any one speech sound is present in the speech signal, thus giving the speech signal resistance to interference
• However, if blending ran to completion, it would obliterate most of the communicative power
• Abler concluded that the psychophysical thresholds of human speech perception superimpose a particulate structure over a blending structure
Abler, W. L. (1989). On the particulate principle of self-
diversifying systems. Social Biological Structures, 12, 1-13.
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 26
A Phylogenetic Perspective
• Spoken language also appears to differ from other particulate systems in that it is driven by ‘contrast’
• This is because it is a behaviour exhibited by living organisms, and it has evolved as a consequence of managing ‘energetics’
• In fact, the structure of all particulate systems is the result of constraints/attractors in …– energy– entropy– time
• Living systems have solved the ‘persistence’ problem by actively managing these dimensions
Moore, R. K. (2007). Spoken
language
processing:
piecing together
the puzzle. Speech
Communication,
49, 418-435.
14
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 27
A Phylogenetic Perspective
• Dependencies exist between many living organisms, and some actively manage such dependencies
• Managing inter-organism dependencies represents a ‘communication’ system
• Many communication systems have evolved which exploit …– information transfer– manipulation
• Human speech has emerged as the highest information-rate system (probably because of the high DoF of the vocal articulators)
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 28
A Phylogenetic Perspective
“The evolution of the active management of
communication under energetic, informational
and temporal constraints leads to an efficient
contrastive particulate system with a structure
and complexity that is a direct consequence of
the degrees-of-freedom of the available
signalling apparatus and the discriminability
supported by the sensory inputs.”
Roger K. Moore, Feb 2011
15
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 29
An Ontogenetic Perspective
• So, what are the implications for an organism/system that has to acquire communicative skills?
• Is the particulate structure …– pre-programmed? �– inferred/acquired from the signal? �
– an emergent consequence? �
• “Ontogeny recapitulates phylogeny” (Haeckel, 1866)
• Learning proceeds through a process of differentiation and factorisation, rather than clustering and segmentation(Karmiloff-Smith, 1992; Hendriks-Jansen, 1996)
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 30
The Role of Production
• The child is an active participant, not a passive observer
• Meaning is grounded in doing (Rizzolatti & Arbib, 1998)
• Speech understanding (and hence speech recognition) arises from inferring ‘communicative intentions’
• I.e. it is an ‘inverse’ problem (that can be solved computationally using ‘analysis-by-synthesis’)
• This is equivalent to invoking generative processes in perceptual interpretation (by recruiting information from the actual motor system)
• Production and perception develop hand-in-hand
16
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 31
Relevant Research at USFD
• Incremental learning of particulate phonological structure– acquisition of phonemic contrast in word pairs
• Speech energetics– biomimetic/animatronic model of the human
tongue and vocal tract (AnTon)
Hofe, R., & Moore, R. K. (2008). Towards an investigation of speech
energetics using 'AnTon': an animatronic model of a human tongue
and vocal tract. Connection Science, 20(4), 319–336.
Aimetti, G., Moore, R. K., & ten Bosch, L. (2010). Discovering an optimal
set of minimally contrasting acoustic speech units: a point of focus for
whole-word pattern matching, INTERSPEECH. Makahuri, Japan.
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 32
Relevant Research at USFD
• Vocabulary growth– no evidence for the ‘vocabulary spurt’
• PRESENCE– predictive sensorimotor control and emulation
Moore, R. K., & ten Bosch, L. (2009). Modelling vocabulary growth from birth to young adulthood, INTERSPEECH. Brighton, UK.
0
5000
10000
15000
20000
25000
30000
35000
40000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Age (years)
Nu
mb
er
of
Wo
rds
0
5000
10000
15000
20000
25000
30000
35000
40000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Age (years)
Nu
mb
er
of
Wo
rds
Moore, R. K. (2007). PRESENCE: A human-inspired
architecture for speech-based human-machine interaction. IEEE Trans. Computers, 56(9), 1176-1188.
S:i S:mx-x -
S:E(U:m)
S:E(U:E(S:m ))
S:E(U:m)
S:E(U:E(S:i))
S:E(U:i)
-
-
x
S:E(U:n)
-
S:n
-mo
tiva
tion
feeling
sensitivity
att
ent
ion
interpretation
actionneeds
nois
e,
dis
tort
ion,
rea
ction
, dis
turb
ance
intention
17
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 33
Relevant Research at USFD
vocal interactivity in and between humans, animals and robots
© 2011 The University of Sheffield
Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 34
Summary
• Human versus machine speech recognition
• Developmentally-inspired ASR
• Research conducted in the EU-FP6 ACORNS FET project
• The particulate structure of speech
• Phylogenetic and ontogenetic perspectives
• The role of the production system
• Relevant research at USFD