Predicting Voice Elicited Emotions Nishant Pandey.
-
Upload
christian-lester -
Category
Documents
-
view
223 -
download
0
Transcript of Predicting Voice Elicited Emotions Nishant Pandey.
![Page 1: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/1.jpg)
Predicting Voice Elicited Emotions
Nishant Pandey
![Page 2: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/2.jpg)
Synopsis
• Problem statement and motivation• Previous work and background• System• Intuition and Overview• Pre-processing of audio signals• Building feature space• Finding patterns in unlabelled data and labelling of samples• Regression Results
• Deployed System• Market Research
![Page 3: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/3.jpg)
Motivation• Automate the screening process in service based industries• Hourly job workers (two-thirds of U.S. Labour force or ~50 million job
seekers every year)
Problem Statement• To be able to analyse voice and predict listener emotions elicited by
the paralinguistic elements of the voice.
![Page 4: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/4.jpg)
Previous work
Current work focuses on predicting the elicited emotions of voice clips.
2 set of goals, which includes recognizing-• the type of personality traits intrinsically possessed by the speaker,
for e.g. speaker trait and speaker state• the types of emotions carried within the speech clip, for e.g. acoustic
affect (cheerful, trustworthy, deceitful etc.)
![Page 5: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/5.jpg)
Background – Emotion Taxonomy
The framework articulated by “FEELTRACE” • Includes all the emotionresponses we want topredict.• Emotions by finite
quantifiable dimensions.
![Page 6: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/6.jpg)
Features - Paralinguistic features of Voices
Concept Definition Data RepresentationAmplitude measurement of the variations over
time of the acoustic signalquantified values of a sound wave’s Oscillation
Energy acoustic signal energy representation in decibels
20*log10(abs(FFT))
Formants the resonance frequencies of the vocal tract
maxima detected using Linear Prediction on audio windows with high tonal content
Perceived pitch Perceived Fundamental frequency and harmonics
Formants
Fundamental frequency the reciprocal of time duration of one glottal cycle - a strict definition of “pitch”
first formant
![Page 7: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/7.jpg)
System – Intuition
Spectrogram of two job applicants responding to “Greet me as if I am a customer”
![Page 8: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/8.jpg)
System – Overview
![Page 9: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/9.jpg)
System – Pre-Processing of Audio Signals
• Pre-processing tasks involve:• Removing voice clips with <2 seconds
length and containing noise• audio signal to data in time and frequency
domain• Short-term Fast Fourier Transform per
frame• Energy measures in frequency domain
per frame• Linear prediction coefficient in
frequency domain per frame
![Page 10: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/10.jpg)
System - Feature Space ConstructionWe experimented with feature construction based on the following dimensions and combinations:• Signal measurements such as energy and amplitude.• Statistics such as min, max, mean, and standard deviation on signal
measurements• Measurement window in time domain: different time size and entire
time window• Measurement window in frequency domain: all frequencies, optimal
audible frequencies, and selected frequency ranges
![Page 11: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/11.jpg)
System – Labels and Right set of Features?• Conventional approach – getting voice samples rated by experts• Unsupervised Learning – Analyse features and their effectiveness
Process:1. Unsupervised learning is used to find patterns in unlabelled data.2. Now, training data sets are constructed based on clustering results
and manual labelling.
![Page 12: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/12.jpg)
System – How do we get the labels? Contd.
Parameters• Cost Function:• Connectivity• Dunn Index• Silhouette
Clustering Results• Technique: Hierarchical
Clustering• Number of clusters: 5 • Manual validation of clusters
was also done
![Page 13: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/13.jpg)
System – Visualization of clusters
![Page 14: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/14.jpg)
System – Modelling
Supervised Learning algorithms• Logistic Regression• Support Vector Machine• Random Forest
Semi-Supervised Learning algorithm• KODAMA
Output:• Binary outcome (positive or negative)• Numerical scores
![Page 15: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/15.jpg)
Case Study – Modelling
• Prediction – Positive vs Negative Response• A positive response could be one or multiple perceptions of a
“pleasant voice”, “makes me feel good”, “cares about me”, “makes me feel comfortable”, or “makes me feel engaged”.
• System.V1 -> Using SVM and V2 -> Random Forest• Interview Prompts: “Greet me as If I am a customer”
![Page 16: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/16.jpg)
System - Prediction Results
• Accuracy : 0.86• 95% CI : (0.76, 0.92)• P-Value [Acc > NIR] : 5.76e-07• Sensitivity : 0.81• Specificity : 0.88• Pos Pred Value : 0.81• Neg Pred Value : 0.88
![Page 17: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/17.jpg)
System - Prediction Results (KODAMA)• Kodama performs feature
extraction from noisy and high-dimensional data.• Output of Kodama includes
dissimilarity matrix from which we can perform clustering and classification.
![Page 18: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/18.jpg)
Deployed System
![Page 19: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/19.jpg)
Market Research• Demographics Matters• Young listeners (18-29 years
old) and Income less than $29000/year have more strict criteria of how they sense engaging.• No Correlation b/w emotion
elicited vs age/ ethnicity/ education level.• Bias towards female voice.
![Page 20: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/20.jpg)
Thanks
![Page 21: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/21.jpg)
Time and Frequency Domain
• Time Domain: https://en.wikipedia.org/wiki/Time_domain#/media/File:Fourier_transform_time_and_frequency_domains_(small).gif
• Frequency Domain:https://en.wikipedia.org/wiki/Frequency_domain#/media/File:Fourier_transform_time_and_frequency_domains_(small).gif
![Page 22: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/22.jpg)
Learnings – Difference in Voice Characteristics
• Result Improves by 10% - when a decision tree is layered by features related to voice characteristic on top of the Random Forest.
![Page 23: Predicting Voice Elicited Emotions Nishant Pandey.](https://reader036.fdocuments.us/reader036/viewer/2022062423/5697c01b1a28abf838ccf7e2/html5/thumbnails/23.jpg)
Prediction Results – SVM vs Random Forest