Björn W. Schuller - kiv.zcu.cz · Björn W. Schuller MachineLearning Group ... Charismatic 63.2...
Transcript of Björn W. Schuller - kiv.zcu.cz · Björn W. Schuller MachineLearning Group ... Charismatic 63.2...
Björn W. SchullerMachine Learning Group
Imperial College LondonChair of Complex & Intelligent Systems
University of PassauaudEERING UG
• Perceived Leadership Traits409 recordings, 10 ratersAcoustic & Linguistic analysis
“The Voice of Leadership: Models and Performances of Automatic Analysis in On-Line Speeches", IEEE Transactions on Affective Computing 3(4): 496-508, 2013.
Social Traits.
UA [%] Non-Malicious 61.3Diplomatic 58.5Achiever 72.5Charismatic 63.2Teamplayer 65.2
4
• Eye Contact from AcousticsGRAS² Corpus: 28 + 4 subjects
“The acoustics of eye contact - Detecting visual attention from conversational audio cues", ACM Gaze-In, 2013. Eye-Contact.
UA [%] / AUCEye-Contact 67.4 / .732
5
• Heart Rate & Skin ConductanceMunich BioVoice Corpus
“Automatic Recognition of Physiological Parameters in the Human Voice: Heart Rate and Skin Conductance", ICASSP, 2013.
Physiology.
CC (MAE) Heart Rate Skin Cond.Independent .382 (17.5) .298Dependent .809 (8.4) .908
010203040506070
Heart Rate C,VR,VC,BR,B
6
• Speech Under Eating & Food 30 subjects, 6 food types, +ASR features
Eating.
UA [%] Spont. Read2-class 91.8 98.77-class 62.3 66.4
“The Interspeech 2015 Computational Paralinguistics Challenge: Nativeness, Parkinson’s & Eating Condition", Interspeech, 2015. 7
R²Crispness .562
# Classes %UA/*AUC/+CC2015 Nativeness [0,1] 43.3+
Parkinson’s [0,100] 54.0+
Eating 7 62.72014 Cognitive Load 3 61.6
Physical Load 2 71.92013 Social Signals 2x2 92.7*
Conflict 2 85.9Emotion 12 46.1Autism 4 69.4
2012 Personality 5x2 70.4 Likability 2 68.7Intelligibility 2 76.8
2011 Intoxication 2 72.2Sleepiness 2 72.5
2010 Age 4 53.6Gender 3 85.7Interest [-1,1] 42.8+
2009 Emotion 5 44.0Negativity 2 71.2
“The Computational Paralinguistics Challenge”, IEEE Signal Processing Magazine, 29(4): 2-6, 2012. 8
Speaker.iHEAR(((u
• DoneDepressionHeight, Race Mother or not? Related?Truthful?Smoker?Empathic?Stressed?…
• Done?WeightSocial statusAddressee…
And Many More...
9
T [h:m] #spkrs #labls2015 Nativeness 9.9 179 5/23
Parkinson’s 2.6 50 --Eating 2.9 30 --
2014 Cognitive Load (2418) 26 --Physical Load (1088) 19 --
2013 Social Signals 8.4 120 1Conflict 11.9 138 550Emotion (1200) 10 --Autism (2500) 99 --
2012 Personality 1.7 322 11Likability 0.7 800 32Intelligibility 2.0 55 13
2011 Intoxication 43.8 162 --Sleepiness 21.3 99 3
2010 Age 50.6 945 --Gender 50.6 945 --Interest 2.3 21 4
2009 Emotion 8.9 51 5Negativity 8.9 51 5
“The Computational Paralinguistics Challenge”, IEEE Signal Processing Magazine, 29(4): 2-6, 2012. 11
Data?iHEAR(((u
• Data often sparse per se, e.g., sparsely occurring state/trait
• Often more ambiguous/challenging to annotate (cf. orthographic transcription)
• Often highly private
• But: Not speech data is lacking (internet, broadcast, etc.) … Labels are missing!
• “Solution A”: Crowd-sourcing• “Solution B”: Weakly supervised / reinforcement learning
Data?
• Playful SourcingiHEARu-PLAY: Gamified annotationCompeting w/ othersGratification…
iHEARu-PLAY
“iHEARu-PLAY: Introducing a game for crowdsourced data collection for affective computing”, WASA, 2015. 15
• Noisy labels (crowd-sourced?)
• Weight raters: EWE, …• Weight instances: Eliminate outliers, …• Compensate time deviations: DTW, …
• Test raters: Anchor points, Agreement with others, … • Measure reliability: Kappa, Correlation, …
• Get more responses…• Ask more targetedly
Pimp my Gold-Standard?
Weakly Supervised Learning.
• Transfer Learning Re-use data of related domain
• Active Learning Select “most informative” instancesfrom large amount of unlabelled data
• Semi-Supervised Learning Have computer label the data
• Cooperative Learning Efficient combination of the above
18
Transfer Learning.
19
• Transfer LearningSparse Auto-Encodertarget values input
Constraining activation ofhidden units to be sparse
Training on target,then transferring source,… and vice versa…
“Linked Source and Target Domain Subspace Feature Transfer Learning", ICPR, 2014.
“Autoencoder-based Unsupervised Domain Adaptation for Speech Emotion Recognition”, IEEE Signal Processing Letters, 2014.
!=
% UA /CC
Tar-get
w/o DAE DAE-NN
ComParE:EC
60.4 56.3 59.2 64.2
M S: A .82 .05 .32 -
M S: V .51 .06 .16 -
Cooperative Learning.
20
Labelled instances
Train
High: SSLElse: AL
Final model
Stop Criteria:
No data ‘likely’ sparse class
Accuracysaturation
Model Classify
Unlabelledinstances
Confidence &
Sparseness
NewlylabelledAdd
Sparse: ALElse:
Discard
“Cooperative Learning and its Application to Emotion Recognition from Speech,” IEEE Transactions on Audio, Speech and Language Processing 23(1):115-126, 2015.
• Example: ComParE:EC1) Active Learning (AL)
+2) Semi-Supervised
=3) Cooperative Learning
cross-view (xv)multi-view (mv)
4) Dynamicfurther 79.2% reduction
21
Improvement of UA ≈ 5.0%
95.0% reduced
Cooperative Learning.
“Cooperative Learning and its Application to Emotion Recognition from Speech,” IEEE Transactions on Audio, Speech and Language Processing 23(1):115-126, 2015.
• Human ConfidencePredict Human Labeller Agreementby +/- 1 individual
• External ConfidenceLearning Behaviour ofRecogniserAdaptation to target domain: semi-supervised learning
Confidence Measures.
22“Confidence Measures for SER: a Start”, IEEE/ITG SC, 2012.“Confidence Measures in SER Based on Semi-supervised Learning”, ISCA Interspeech, 2012.
AI / MLSystem
• Multiple-TargetsThere is just oneVocal ProductionMechanism…
Only one Vocal Tract
24
Nasal cavity
Jaw
Oral cavity
Velum
Teeth
Glottis
LipsPharynx
Supra-glottalsystemSub-glottalsystem
Palate
Tongue
% UA Single MultipleLikability 59.1 (+A,G,Cl) 62.2Neuroticism 62.9 (+G,OCEA, Cl) 67.5
Drunk AngryHas a Cold
NeuroticTired
…
Has Parkinson‘s
…
Is Older
+<label> (e.g., “burst”)
Processing: Holism?…
Signal Capture
s[k]
Learning
/Decisiony
Prepro-cessing
s‘[k]
FeatureExtraction
x
Learning at Signal‘s Edge
“Big Data vs. Little Labels”
Efficient WeaklySupervised
• openSMILEBrute-force High-Dim. Spaces(Android / C++)Online update
Feature Extraction.
26“Recent Developments in openSMILE, the Open-Source Multimedia Feature Extractor”, ACM Multimedia, 2013.(2nd place ACM MM Open Source Software Competition in 2010 and 2013, ~600 citations for 3 papers)
Energy
Harmonicity
Fund. Freq.
…
…
…
TF-Transform
Deriving
Extremes
Moments
Peaks
Segments
Spectral
…
Regression
Filte
ring
Deriving
Chu
nkin
g
Filte
ring
Sens
or-S
igna
l
#features RTF10k 2%500k 3%
• Geneva Minimalistic SetGeMAPS: 18 LLDs / 62 functionalsPitch, Jitter, Formant 1-3, Formant 1 bandwidthShimmer, Loudness, HNRAlpha Ratio, Hammarberg Index, Spectral Slope 0–500Hz
and 500–1500 Hz, Formant 1-3 relative energy, Harmonic difference H1–H2, Harmonic difference H1–A3
Extended: +7 LLDs / 88 functionalsMFCC 1–4, Spectral Flux, Formant 2–3 bandwidth
Less is More?
“The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing”, IEEE Transactions on Affective Computing, 2015. 27
• Face Reading from SpeechGEMEP corpus, SVM vs DLSTMNNs, big (65) vs small (28) feature set
“Face Reading from Speech – Predicting Facial Action Units from Audio Cues”, Interspeech, 2015.
Action Units. ARIA VALUSPA
28
• Audio WordsBase: UA/WA = 54.3/61.2%VQ vs SVQVAM: Valence
Bag-of-X-Words
“Detection of Negative Emotions in Speech Signals Using Bags-of-Audio-Words”, WASA, 2015. 29
…
y
x
Deep Learning.
• Deep Neural NetworksDNN = MLP…Training makes the difference:
Build them layer-wise Less uninitialized parameters
Use “raw” features as inputNet learns own higher-level features
Use more training dataOn-line learning (stoch. gradient descent)Use ReLUs, drop-out learning, …
30
y
x
x
y
x
, ∗ ,∈
∗
, ⊂
• LSTM Cell
Linear UnitAuto weight 1“Error Carousel”
Non-linear GateInput / Output / Forget
Multiplicative Open / Shutdown
O
31
I
EC
1
F
Deep LSTM Nets.
Example:ComParE:SSC
% AUC (Acc.) TestNN 83.3Deep NN 92.4LSTM 93.0Deep LSTM 94.0
“Social Signal Classification Using Deep BLSTM Recurrent Neural Networks”, IEEE ICASSP, 2014.(Best Result ComParE:SSC)
Parallel Learning.
• CURRENNT 10 – 1k LSTM cells, 2k – 4Mio parameters, Distribution
33“Introducing CURRENNT - the Munich Open-Source CUDA RecurREnt Neural Network Toolkit”, Journal of Machine Learning Research, 2014.
• Semi-NMF
Matrix factorization, uncovers meaningful featuresClustering interpretation:
Z: cluster centroidsH: soft membership for every data point
Soft version of k-means clustering – equivalent: iff H is orthogonal with:
Better, if data is not distributed in spherical manner
Learning Hidden Representations.
34
1if belongstocluster0otherwise
Deep Semi-NMF.
36
UAAUC [%] Pose Emotion IDSemi-NMF 94.32 44.72 50.33Deep Semi-NMF 99.78 73.33 74.79A priori DSNMF 99.99 76.67 80.44
“A Deep Semi-NMF Model for Learning Hidden Representations”, ICML, 2014.
Distribution. “Distributing Recognition in Computational Paralinguistics,” IEEE Transactions on Affective Computing 5:406–417, 2014.
orig 12 MFCC
38
• Mobile PlatformSamsung Galaxy S3 (Android) vs Intel Core i3 2.1 GHz Laptop
Embedding.
39“Real-time Robust Recognition of Speakers Emotions and Characteristics on Mobile Platforms”, ACII, 2015.
Set #feats.
RTF (S3)
RTF (i3)
GeMAPSExtended
88 2.63 0.11
InterspeechEmotion Ch.
384 0.43 0.04
InterspeechComParE
6373 2.81 0.12
• Assessing Public Speaking SkillsICASSP 2011 lectures, 3 classes (from 5) or continuous, 5 raters
“Does my Speech Rock? Automatic Assessment of Public Speaking Skills”, Interspeech, 2015.
Public Speeches.
44