Björn W. Schuller - kiv.zcu.cz · Björn W. Schuller MachineLearning Group ... Charismatic 63.2...

40
Björn W. Schuller Machine Learning Group Imperial College London Chair of Complex & Intelligent Systems University of Passau audEERING UG

Transcript of Björn W. Schuller - kiv.zcu.cz · Björn W. Schuller MachineLearning Group ... Charismatic 63.2...

Björn W. SchullerMachine Learning Group

Imperial College LondonChair of Complex & Intelligent Systems

University of PassauaudEERING UG

• Perceived Leadership Traits409 recordings, 10 ratersAcoustic & Linguistic analysis

“The Voice of Leadership: Models and Performances of Automatic Analysis in On-Line Speeches", IEEE Transactions on Affective Computing 3(4): 496-508, 2013.

Social Traits.

UA [%] Non-Malicious 61.3Diplomatic 58.5Achiever 72.5Charismatic 63.2Teamplayer 65.2

4

• Eye Contact from AcousticsGRAS² Corpus: 28 + 4 subjects

“The acoustics of eye contact - Detecting visual attention from conversational audio cues", ACM Gaze-In, 2013. Eye-Contact.

UA [%] / AUCEye-Contact 67.4 / .732

5

• Heart Rate & Skin ConductanceMunich BioVoice Corpus

“Automatic Recognition of Physiological Parameters in the Human Voice: Heart Rate and Skin Conductance", ICASSP, 2013.

Physiology.

CC (MAE) Heart Rate Skin Cond.Independent .382 (17.5) .298Dependent .809 (8.4) .908

010203040506070

Heart Rate C,VR,VC,BR,B

6

• Speech Under Eating & Food 30 subjects, 6 food types, +ASR features

Eating.

UA [%] Spont. Read2-class 91.8 98.77-class 62.3 66.4

“The Interspeech 2015 Computational Paralinguistics Challenge: Nativeness, Parkinson’s & Eating Condition", Interspeech, 2015. 7

R²Crispness .562

# Classes %UA/*AUC/+CC2015 Nativeness [0,1] 43.3+

Parkinson’s [0,100] 54.0+

Eating 7 62.72014 Cognitive Load 3 61.6

Physical Load 2 71.92013 Social Signals 2x2 92.7*

Conflict 2 85.9Emotion 12 46.1Autism 4 69.4

2012 Personality 5x2 70.4 Likability 2 68.7Intelligibility 2 76.8

2011 Intoxication 2 72.2Sleepiness 2 72.5

2010 Age 4 53.6Gender 3 85.7Interest [-1,1] 42.8+

2009 Emotion 5 44.0Negativity 2 71.2

“The Computational Paralinguistics Challenge”, IEEE Signal Processing Magazine, 29(4): 2-6, 2012. 8

Speaker.iHEAR(((u

• DoneDepressionHeight, Race Mother or not? Related?Truthful?Smoker?Empathic?Stressed?…

• Done?WeightSocial statusAddressee…

And Many More...

9

10

Working, Yet?

T [h:m] #spkrs #labls2015 Nativeness 9.9 179 5/23

Parkinson’s 2.6 50 --Eating 2.9 30 --

2014 Cognitive Load (2418) 26 --Physical Load (1088) 19 --

2013 Social Signals 8.4 120 1Conflict 11.9 138 550Emotion (1200) 10 --Autism (2500) 99 --

2012 Personality 1.7 322 11Likability 0.7 800 32Intelligibility 2.0 55 13

2011 Intoxication 43.8 162 --Sleepiness 21.3 99 3

2010 Age 50.6 945 --Gender 50.6 945 --Interest 2.3 21 4

2009 Emotion 8.9 51 5Negativity 8.9 51 5

“The Computational Paralinguistics Challenge”, IEEE Signal Processing Magazine, 29(4): 2-6, 2012. 11

Data?iHEAR(((u

• Data often sparse per se, e.g., sparsely occurring state/trait

• Often more ambiguous/challenging to annotate (cf. orthographic transcription)

• Often highly private

• But: Not speech data is lacking (internet, broadcast, etc.) … Labels are missing!

• “Solution A”: Crowd-sourcing• “Solution B”: Weakly supervised / reinforcement learning

Data?

More Input (or: we need labels!)

• Playful SourcingiHEARu-PLAY: Gamified annotationCompeting w/ othersGratification…

iHEARu-PLAY

“iHEARu-PLAY: Introducing a game for crowdsourced data collection for affective computing”, WASA, 2015. 15

• Noisy labels (crowd-sourced?)

• Weight raters: EWE, …• Weight instances: Eliminate outliers, …• Compensate time deviations: DTW, …

• Test raters: Anchor points, Agreement with others, … • Measure reliability: Kappa, Correlation, …

• Get more responses…• Ask more targetedly

Pimp my Gold-Standard?

Weakly Supervised Learning.

• Transfer Learning Re-use data of related domain

• Active Learning Select “most informative” instancesfrom large amount of unlabelled data

• Semi-Supervised Learning Have computer label the data

• Cooperative Learning Efficient combination of the above

18

Transfer Learning.

19

• Transfer LearningSparse Auto-Encodertarget values input

Constraining activation ofhidden units to be sparse

Training on target,then transferring source,… and vice versa…

“Linked Source and Target Domain Subspace Feature Transfer Learning", ICPR, 2014.

“Autoencoder-based Unsupervised Domain Adaptation for Speech Emotion Recognition”, IEEE Signal Processing Letters, 2014.

!=

% UA /CC

Tar-get

w/o DAE DAE-NN

ComParE:EC

60.4 56.3 59.2 64.2

M S: A .82 .05 .32 -

M S: V .51 .06 .16 -

Cooperative Learning.

20

Labelled instances

Train

High: SSLElse: AL

Final model

Stop Criteria:

No data ‘likely’ sparse class

Accuracysaturation

Model Classify

Unlabelledinstances

Confidence &

Sparseness

NewlylabelledAdd

Sparse: ALElse:

Discard

“Cooperative Learning and its Application to Emotion Recognition from Speech,” IEEE Transactions on Audio, Speech and Language Processing 23(1):115-126, 2015.

• Example: ComParE:EC1) Active Learning (AL)

+2) Semi-Supervised

=3) Cooperative Learning

cross-view (xv)multi-view (mv)

4) Dynamicfurther 79.2% reduction

21

Improvement of UA ≈ 5.0%

95.0% reduced

Cooperative Learning.

“Cooperative Learning and its Application to Emotion Recognition from Speech,” IEEE Transactions on Audio, Speech and Language Processing 23(1):115-126, 2015.

• Human ConfidencePredict Human Labeller Agreementby +/- 1 individual

• External ConfidenceLearning Behaviour ofRecogniserAdaptation to target domain: semi-supervised learning

Confidence Measures.

22“Confidence Measures for SER: a Start”, IEEE/ITG SC, 2012.“Confidence Measures in SER Based on Semi-supervised Learning”, ISCA Interspeech, 2012.

AI / MLSystem

Holism (or: if we had the labels…)

• Multiple-TargetsThere is just oneVocal ProductionMechanism…

Only one Vocal Tract

24

Nasal cavity

Jaw

Oral cavity

Velum

Teeth

Glottis

LipsPharynx

Supra-glottalsystemSub-glottalsystem

Palate

Tongue

% UA Single MultipleLikability 59.1 (+A,G,Cl) 62.2Neuroticism 62.9 (+G,OCEA, Cl) 67.5

Drunk AngryHas a Cold

NeuroticTired

Has Parkinson‘s

Is Older

+<label> (e.g., “burst”)

Processing: Holism?…

Signal Capture

s[k]

Learning

/Decisiony

Prepro-cessing

s‘[k]

FeatureExtraction

x

Learning at Signal‘s Edge

“Big Data vs. Little Labels”

Efficient WeaklySupervised

• openSMILEBrute-force High-Dim. Spaces(Android / C++)Online update

Feature Extraction.

26“Recent Developments in openSMILE, the Open-Source Multimedia Feature Extractor”, ACM Multimedia, 2013.(2nd place ACM MM Open Source Software Competition in 2010 and 2013, ~600 citations for 3 papers)

Energy

Harmonicity

Fund. Freq.

TF-Transform

Deriving

Extremes

Moments

Peaks

Segments

Spectral

Regression

Filte

ring

Deriving

Chu

nkin

g

Filte

ring

Sens

or-S

igna

l

#features RTF10k 2%500k 3%

• Geneva Minimalistic SetGeMAPS: 18 LLDs / 62 functionalsPitch, Jitter, Formant 1-3, Formant 1 bandwidthShimmer, Loudness, HNRAlpha Ratio, Hammarberg Index, Spectral Slope 0–500Hz

and 500–1500 Hz, Formant 1-3 relative energy, Harmonic difference H1–H2, Harmonic difference H1–A3

Extended: +7 LLDs / 88 functionalsMFCC 1–4, Spectral Flux, Formant 2–3 bandwidth

Less is More?

“The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing”, IEEE Transactions on Affective Computing, 2015. 27

• Face Reading from SpeechGEMEP corpus, SVM vs DLSTMNNs, big (65) vs small (28) feature set

“Face Reading from Speech – Predicting Facial Action Units from Audio Cues”, Interspeech, 2015.

Action Units. ARIA VALUSPA

28

• Audio WordsBase: UA/WA = 54.3/61.2%VQ vs SVQVAM: Valence

Bag-of-X-Words

“Detection of Negative Emotions in Speech Signals Using Bags-of-Audio-Words”, WASA, 2015. 29

y

x

Deep Learning.

• Deep Neural NetworksDNN = MLP…Training makes the difference:

Build them layer-wise Less uninitialized parameters

Use “raw” features as inputNet learns own higher-level features

Use more training dataOn-line learning (stoch. gradient descent)Use ReLUs, drop-out learning, …

30

y

x

x

y

x

, ∗ ,∈

, ⊂

• LSTM Cell

Linear UnitAuto weight 1“Error Carousel”

Non-linear GateInput / Output / Forget

Multiplicative Open / Shutdown

O

31

I

EC

1

F

Deep LSTM Nets.

Example:ComParE:SSC

% AUC (Acc.) TestNN 83.3Deep NN 92.4LSTM 93.0Deep LSTM 94.0

“Social Signal Classification Using Deep BLSTM Recurrent Neural Networks”, IEEE ICASSP, 2014.(Best Result ComParE:SSC)

• “Very Deep” NN

Deep Recurrent Nets.

32

t-1

t

Back

prop

Backprop through time

, … , …

Parallel Learning.

• CURRENNT 10 – 1k LSTM cells, 2k – 4Mio parameters, Distribution

33“Introducing CURRENNT - the Munich Open-Source CUDA RecurREnt Neural Network Toolkit”, Journal of Machine Learning Research, 2014.

• Semi-NMF

Matrix factorization, uncovers meaningful featuresClustering interpretation:

Z: cluster centroidsH: soft membership for every data point

Soft version of k-means clustering – equivalent: iff H is orthogonal with:

Better, if data is not distributed in spherical manner

Learning Hidden Representations.

34

1if belongstocluster0otherwise

Deep Semi-NMF.

35“A Deep Semi-NMF Model for Learning Hidden Representations”, ICML, 2014.

Deep Semi-NMF.

36

UAAUC [%] Pose Emotion IDSemi-NMF 94.32 44.72 50.33Deep Semi-NMF 99.78 73.33 74.79A priori DSNMF 99.99 76.67 80.44

“A Deep Semi-NMF Model for Learning Hidden Representations”, ICML, 2014.

Distribution (or: get more data…)

Distribution. “Distributing Recognition in Computational Paralinguistics,” IEEE Transactions on Affective Computing 5:406–417, 2014.

orig 12 MFCC

38

• Mobile PlatformSamsung Galaxy S3 (Android) vs Intel Core i3 2.1 GHz Laptop

Embedding.

39“Real-time Robust Recognition of Speakers Emotions and Characteristics on Mobile Platforms”, ACII, 2015.

Set #feats.

RTF (S3)

RTF (i3)

GeMAPSExtended

88 2.63 0.11

InterspeechEmotion Ch.

384 0.43 0.04

InterspeechComParE

6373 2.81 0.12

Outlook

• Interactive Emotion Game• For Children with

Autism Spectrum Condition

42

Serious Game.

• Assessing Public Speaking SkillsICASSP 2011 lectures, 3 classes (from 5) or continuous, 5 raters

“Does my Speech Rock? Automatic Assessment of Public Speaking Skills”, Interspeech, 2015.

Public Speeches.

44

Agents 2.0.

46

ARIA VALUSPA

• 24/7 Coop Learning

• Big Data Multi-task DL

• Multilingualism

• “Green” Learning

• Evolving Machines

• Products & RL

• CP from a Chips Bag?47

DeepBLSTM

PriorsLow

‐LevelPosteriors

Target 1Target N

Confidence N

Uncertainty Weighted Combination

And Next…