Final project new

42
Under guidance of Dr. G. Pradhan NIT PATNA (ECE dept.) Presented by - Kamlesh Kalvaniya -(1104080) Niranjan Kumar –(1104087) Piyush Kumar-(1104091) B.TECH 4 th yr (ECE dept.) 06/13/2022 N.I.T. PATNA ECE, DEPTT. 1 SPEAKER RECOGNITION UNDER LIMITED DATA CONDITION

Transcript of Final project new

Page 1: Final project new

Under guidance of Dr. G. PradhanNIT PATNA (ECE dept.)

Presented by -Kamlesh Kalvaniya -(1104080)Niranjan Kumar –(1104087)Piyush Kumar-(1104091)B.TECH 4th yr (ECE dept.)N.I.T. PATNA ECE, DEPTT.

SPEAKER RECOGNITION UNDER LIMITED DATA CONDITION

Page 2: Final project new

1. Introduction 2. Baseline speaker verification system3. Future Plan

OUTLINE

Page 3: Final project new

Speaker Recognition is the computing task of validating identity claim of a person

from his/her voice.

Applications:- Authentication Forensic test Security system ATM Security Key Personalized user interface Multi speaker tracking Surveillance

Introduction

Page 4: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 4

Identification v/s verification

Page 5: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 5

Phase of Speaker Verification• Enrollment Session or Training Phase• Operating Session or Testing Phase

Page 6: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 6

Training & Testing Phase

Training Reference model

Speech

Identity claim

Testing

Speech R

Accept/reject

Pre-

processing

Feature

extraction

Model

Building

Pre-

processing

Feature

extraction

comparison

Decision logic

Page 7: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 7

PreprocessingPreprocessing is an important step in a speaker verification system. This also called voice activity detection (VAD).

VAD separates speech region from non-speech regions[2-3] It is very difficult to implement a VAD algorithm which works consistently for

different type of data VAD algorithms can be classified in two groups

Feature based approach Statistical model based approach

Each of the VAD method have its own merits and demerits depending on accuracy, complexity etc.

Due to simplicity most of the speaker verification systems use signal energy for VAD.

Page 8: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 8

The speech signal along with speaker information contains many other redundant information like recording sensor, channel, environment etc.The speaker specific information in the speech signal[2] Unique speech production system Physiological Behavioral aspects

Feature extraction module transforms speech to a set of feature vectors of reduce dimensions To enhance speaker specific information Suppress redundant information.

Feature Extraction

Page 9: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 9

• Robust against noise and distortion• Occur frequently and naturally in speech• Be easy to measure from speech signal• Be difficult to impersonate/mimic• Not be affected by the speaker’s health or long term variations in voice

Selection of Features

Page 10: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 10

Types Of Features

Page 11: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 11

Feature Extraction Techniques

A wide range of approaches may be used to parametrically represent the speech signal to be used in the speaker recognition activity. Linear Prediction Coding Linear Predictive Ceptral Coefficients Mel Frequency Ceptral Coefficients Perceptual Linear Prediction Neural Predictive Coding

Most of the state-of-the-art speaker verification systems use Mel-frequency Cepstral Coefficient (MFCC) appended to it’s first and second order derivative as the feature vectors

Easy to extract Provides best performance compared to other features MFCC mostly contains information about the resonance structure of the vocal tract

system

Page 12: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 12

1. Analog to digital conversion 2. Pre emphasis 3. Framing & windowing4. Fast Fourier Transform5. Mel scale wrapping6. MFCC

MEL FREQUENCY CEPTRAL COEFFICIENTS

Page 13: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 13

MFCC

Step 1:- Analog to digital conversion: is transformed to digital form by sampling it at given frequency.

Page 14: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 14

MFCC

Step 2:- Pre-emphasis: The amount of energy present in the high frequency (important for speech) are boosted.

Page 15: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 15

MFCC

Step 3:(framing)the signal is divided into frames of given size.

Page 16: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 16

MFCC FRAMING

Page 17: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 17

MFCC FRAMING

Page 18: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 18

MFCC FRAMING

Page 19: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 19

MFCC FRAMING

25ms

10 ms

Page 20: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 20

MFCC WINDOWING

• The next step is to window individual frame to minimize the signal discontinuities at the beginning and end of each frame.

• The concept applied here is to minimize the spectral distortion by using the window to taper the signal to zero at the beginning and end of each frame.

• We have used hamming window

Page 21: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 21

MFCC

Page 22: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 22

MFCC

Page 23: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 23

MEL FILTERBANK

Page 24: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 24

MFCC

DCT

Page 25: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 25

MFCC

DCT

Page 26: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 26

Speaker Modelling

• Vector Quantization• Gaussian Mixture Model• Gaussian Mixture Model-UBM• Hidden Markov Model• Artificial Neural Networks• Super Vector Machines• I-Vector

Gaussian model assumes the feature vectors follow a Gaussian distribution, characterized by mean vectors, covariance matrix and weights

The data unseen in the training which appear in the test data will trigger a low score

Speaker models the statistical information present in the feature vectors it enhances the speaker information and suppress the redundant information

Page 27: Final project new

04/15/2023 27

A Gaussian mixture density defined as-

A Gaussian function for D dimension is defined as-

where- Unimodal Gaussian D=8,16,32,64

ʎ i = {wi , ∑i µi }

wi = Weight

µi = Mean ;

∑i = Covariance ;

i-No. of models(M=356)N.I.T. PATNA ECE, DEPTT.

Gaussian Mixture Model

Page 28: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 28

For a sequence of T training vector X={x1 , x2 ,…, xT } the GMM likelihood can be defined as-

For estimation of speaker specific GMM, Expectation maximization algorithm is used .

MAXIMUM LIKLIHOOD PARAMETER ESTIMATION

Page 29: Final project new
Page 30: Final project new

30

ʎtarget : X(MFCC(TESTING DATA)) is from the hypothesized

speaker S ʎUBM : X(MFCC(TESTING DATA)) is not from the

hypothesized speaker S The likelihood ratio test is given by- LR(X)=

The probability of alternative hypothesis P(X/ʎUBM ) =F( P(X/ʎ1), P(X/ʎ2),..., P(X/ʎM))

F( ) is function such as average or maximum of likelihood value of Background Speaker set ( P(X/ʎi) ) .04/15/2023 N.I.T. PATNA ECE, DEPTT.

GMM UBM

Page 31: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 31

Score Normalisation Where- s- Original Score = log(LR(X)); µI - Estimated mean of s

σI -standard deviation of s

Score Normalisation

Page 32: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 32

PERFORMANCE EVALUATION NIST has conducted speaker recognition

benchmarking activity on annual basis since 1997.

NIST has provided speech files as development data.

NIST 2003 data- Testing Speech Data-2559 Train Speech Data-356 UBM Female Speech data-251 UBM male Speech data-251

Page 33: Final project new

For Baseline speaker verification the following parameter are used VAD: Energy based VAD (0.6 * average

energy) Feature vector: 13 dimension MFCC appended with

delta and delta-delta Modeling: GMM GMM size: 8, 16, 32, 64.0Comparison: log Likelihood score

Development of BASELINE SPEAKER VERIFICATION SYSTEM

Page 34: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 34

. Flowchart Of Baseline Speaker Recognition System

Page 35: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 35

DET PLOTFOR TEST

15 SecAND

TRAIN15 SEC

Page 36: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 36

DET PLOTFOR TEST FULLAND

TRAIN15 SEC

Page 37: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 37

DET PLOTFOR TEST

15 SecAND

TRAINFULL

Page 38: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 38

DET PLOTFOR TEST FULLAND

TRAINFULL

Page 39: Final project new

Comparison of training data model with Equal Error Rate

.

GAUSSIAN SIZE

8

16

32

64

TEST 15 SecTRAIN 15 SEC

Test FullTrain 15 sec

TEST 15 secTrain Full

Test FullTrain Full

EQUAL ERROR RATE(%)

EQUAL ERROR RATE(%)

EQUAL ERROR RATE(%)

EQUAL ERROR RATE(%)

34.90 34.24 33.18 27.70

33.05 32.28 30.50 25.67

32.46 32.94 28.78 23.67

32.82 33.06 27.42 22.05

Page 40: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 40

Conclusion

Performance is more sensitive to training data.

Page 41: Final project new

04/15/2023 N.I.T. PATNA ECE, DEPTT. 41

Future Plan

Synthetically generating training and testing speech from limited speech data.

Validating the results on state-of-the-art i-vector based speaker verification system.

Page 42: Final project new

Thank you