thesis - MIT Media Labdkroy/papers/pdf/phd_thesis... · 2009. 8. 24. · /* Consider all pairs of...
Transcript of thesis - MIT Media Labdkroy/papers/pdf/phd_thesis... · 2009. 8. 24. · /* Consider all pairs of...
Utterance-Context Pair
semanticcategory N
semanticcategory 2
semanticcatregory 1
context
utterance
linguisticunit 1
linguisticunit 2
linguisticunit M
Sensors
utterance
context
linguistic unitprototype
semanticcategoryprototype
linguistic unitprototype
linguistic unitprototype
semanticcategoryprototype
semanticcategoryprototype
linguisticunit
prototypelinguistic
unitprototype
linguisticunit
prototype
semanticcategoryprototype
semanticcategoryprototype
semanticcategoryprototype
semanticcategoryprototype
semanticcategoryprototype
linguisticunit
prototype
linguisticunit
prototypelinguistic
unitprototype
semanticcategoryprototype
semanticcategoryprototype
semanticcategoryprototype
Lexicon
semanticcategory
linguisticunit
semanticcategory
linguisticunit
linguisticunit
prototypelinguistic
unitprototype
Linguistic Channels
Contextual Channels
Short Term Memory (STM)
Mid Term Memory (MTM)
Long Term Memory (LTM)
Linguistic-Semantic Events (LS-events)
Lexical Candidates
Lexical Items
Co-occurence filter
Recurrence filter
Mutual Information filter
time
Linguistic Events (L-events)Semantic Events (S-events)
Event Detection
Input Sensor Signals
Feature Extraction
input signalssensors
featureanalyzers
time
time
Ling
uist
icch
anne
lsCo
ntex
tual
chan
nels
linguisticevent
detectorlinguistic channels
contextual channelssemantic
eventdetector
S-events
L-events
time
time
The event is divided into an array alongchannel and time segment boundaries.
An L-event or S-event is composed ofmultiple channels
channel 1
segm
ent 1
segm
ent 2
segm
ent n
channel 3
channel 2
channel 1
channel 3
channel 2 EventSegmenter
An event divided along time segments and channels
Some potential subevents
semantic events (S-events)
linguistic events (L-events)time
co-occuring L-events and S-eventsare paired to form LS-events
short term memory (STM)contains recent LS-events
old LS-eventsforgotten
L-event L-event L-eventL-eventL-event
S-event S-event S-eventS-eventS-event
/* Consider all pairs of LS-events in short term memory */for each pair of LS-events in STM, LSi and LSj {
/* Compare each pair of L-subevents in LSi and LSj */ for each L-subevent in LSi, Li { for each L-subevent in LSj, Lj{ if dL(Li, Lj) < tL then set Lmatch = TRUE } }
/* Compare each pair of S-subevents in LSi and LSj */ for each S-subevent in LSi, Si { for each S-subevent in LSj, Sj{ if dS(Si, Sj) < tS then set Smatch = TRUE } }
/* check for matches of L-subevents and co-occuring S-subevents */ if Lmatch = TRUE and Smatch = TRUE then recurrent match found }}
Short Term Memory (STM)
Filled regionsindicate recurrentL-subeventsand S-subevents
LS-event LS-eventLS-event
Lexical Candidate
linguistic unit prototype ( )
semantic prototype ( )
Mid Term Memory (MTM)
linguistic feature space
L-radius ( )
L-prototype ( )
L-unit = { , }
S-prototype ( )
S-radius ( )
S-category = { , }
contextual feature space
medium
large
cl
o
ad
j
h
g
e
m
p
b
k
n
u
t
sr
z
yx
wv
if q
c
o
a
d
g
ep
b
k
n
u
y
wv
c
o
ad
g
e
p
b
k
n
u
y
wv
c
o
a
d
g
e
m
p
b
k
n
u
s
y
x
wvc
o
ad
g
e
m
p
b
k
n
u
s
y
wv
x
c
o
a
dj
h
g
e
m
p
b
k
n
u
ts
r
z
y
x
wv
l
s
r
li
j
h
f mt
q
zxl
ij
hf m
t
sr
qz
x
li
j
h
ft
rq
zilj
hf
t
r
qz
i
fq
small
I(S;L) = 0.013 bits
I(S;L) = 0.29 bits
I(S;L) = 0.0 bits
Mid Term Memory (MTM)
Lexical Item
MutualInformation Filter
Sensors
L-event
LTM
S-category ofrecognized L-unit
L-subeventmatchesL-unit in lexical item
Feature analysisL-event detection
Event segmentation
Sensors
Feature analysisS-event detection
Event segmentation
S-event
LTM
L-unitof recognizedS-categoryS-subevent matches
S-category in lexical item
Sensors
STM
Feature analysisEvent detection
Event segmentation
LS-prototypehypotheses
Lexicalsearch
matched hypothesesexplained away
Recurrencefilter
Mutualinformation
filterLTMMTM
Lexical item i
L-unitS-category
Lexical item j
L-units andS-categoriesoverlap
Matching lexical items are clustered toform a conglomerate lexical item
Lexical item i
L-unit S-category
S-prototype i matches S-category j
Lexical item j
Lexical item i Lexical item j
L-prototype j matches L-unit i
L-unit
S-category
Linguistic Units Semantic Categories
Environment
lexical item confidenceadjustment
Feedback
Actionselection
Goals
LTM
thresholdadjustment
MTMmutual
informationfilter
Objectdetection
Spokenutterancedetection
S-events: object view-sets
L-events: spoken utterances
Linguistic channel:phoneme probabilities
Contextual channels:object shape & color
L-subevents:speech segments
LS-events: {spoken utterance, object view-set}
S-subevents:shape / color view-sets
Lexical candidates: {spoken word prototype, color/shape prototype}
Lexical items:{spoken word model, color/shape category}
Objectshape
analysisPhonemeanalysis
Objectcolor
analysis
Microphone Camera
S-eventunpacking
L-eventunpacking
Co-occurencefilter
Recurrencefilter
Mutual informationfilter
Short TermMemory
Long TermMemory
Mid TermMemory
object maskmasked
color image
mask-edgespatial derivative
analysis
color image
foreground bitmap
connectedregions analysis
foregroundsegmentation
CCDcamera
Context Channel 1: Shape
Context Channel 2: Color
Original RGB Image Shape histogramObject maskrelative angle
norm
alize
d di
stan
ce
normalized green
norm
alize
d re
dColor Histogram
DOF 1: Base rotation
DOF 2:Base elevation
DOF 4:Neck
elevation
DOF 5: Object turntable rotation
DOF 3:Neck rotation
Color CCDCamera
RASTA-PLPspectralanalysis
timedelay
12 units
176 units
176 units
40 units
Linguistic channel:phoneme probabilitiesRecurrent Neural Network
aaaeahawayb
chd
dhdxehereyfg
hhihiyjhklmn
ngowoypqrs
shsilt
thuhuwvwyz
aaaeahawaybchddhdxehereyfghhihiyjhklmnngowoypqrsshsiltthuhuwvwyz
aaaeahawayb
chd
dhdxehereyfg
hhihiyjhklmn
ngowoypqrs
shsilt
thuhuwvwyz
aaaeahawaybchddhdxehereyfghhihiyjhklmnngowoypqrsshsiltthuhuwvwyz
state = 1; count_2 = 0; count_3 = 0; count_4 = 0UTTERANCE_START_DELAY = 50ms; UTTERANCE_END_DELAY = 300ms
for each RNN output vector, l(t) {
state 1: SILENCE if SIL != 1 { utteranceStartIndex = t state=2 } else { state = 1 }
state 2: POSSIBLE_START_OF_UTTERANCE count_2 = count_2 + 1 if SIL = 1 { count_2 = 0 state = 1 } else if {count_2 > UTTERANCE_START_DELAY) { state = 3 }
state 3: UTTERANCE if SIL { state = 4 } else { count_3 = count_3 + 1 state = 3 }
state 4: POSSIBLE_END_OF_UTTERANCE count_4 = count_4 + 1 if SIL != 1 { count_3 = count_3 + count_4 count_4 = 0 state = 3 } } else if count_4 > UTTERANCE_END_DELAY { utteranceEndIndex = t - count_4 - 1 ProcessUtterance(utteranceStartIndex, utteranceEndIndex) count_2 = 0 count_3 = 0 count_4 = 0 state = 1 } }}
utterancestart
utteranceend
null null
a
b
silence
Viterbialgorithm / b aa l /
aaaeahawaybchddhdxehereyfghhihiyjhklmnngowoypqrsshsiltthuhuwvwyz
Most likelyphoneme sequence
b aa l
Hidden MarkovModel
RNN outputphoneme probabilities
"yeah"
Mut
ual I
nfor
mat
ion
L-radius S-radius
"dog"
Mut
ual I
nfor
mat
ion
L-radius S-radius
0 5 10 15 20 25 30 35 40
Distance between view-sets
Hist
ogra
m b
in o
ccup
ancy
(nor
mal
ized)
0 5 10 15 20 25 30 35 40
Distance between view-sets
Hist
ogra
m b
in o
ccup
ancy
(nor
mal
ized)
0 5 10 15 20 25 30 35 40
Distance between view-sets
Hist
ogra
m b
in o
ccup
ancy
(nor
mal
ized)
0 5 10 15 20 25 30 35 40 450
1000
2000
3000
4000
5000
6000
7000
8000
9000
0
5
10
15
20
25
30
35
40
CELL AcousticRecurrency
28%
7%
0
10
20
30
40
50
60
70
80
90
100
CELL AcousticRecurrency
72%
31%
0
10
20
30
40
50
60
70
80
CELL AcousticRecurrency
57%
13%
CELL
Spoken commands
Tasksemantics
User-dependentacoustic & semantic
model
Scene 1: User points to three colors in therainbow and names them (lexical acquisition)
Scene 2: User selects a part from the "Tree of Life" by pointing to the part
Scene 3: Part is colored by speech using oneof the three lexical items learned in Scene 1
Scene 4: User must select position for newbody part using gesture, confirm with speech
Scene 5: A successfully placed part Scene 6: After two more cycles of Scenes 2-5the mate is complete and Toco looks on in new-found love