Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013...

60
Development of a Speaker Recognition Solution in Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj¨orklund Supervisor at Codemill: Thomas Knutsson Examiner: Frank Drewes Ume ˚ a University Department of Computing Science SE-901 87 UME ˚ A SWEDEN

Transcript of Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013...

Page 1: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

Development of a SpeakerRecognition Solution in

Vidispine

Karen Farnes

May 23, 2013Master’s Thesis in Computing Science, 30 credits

Supervisor at CS-UmU: Henrik BjorklundSupervisor at Codemill: Thomas Knutsson

Examiner: Frank Drewes

Umea UniversityDepartment of Computing Science

SE-901 87 UMEASWEDEN

Page 2: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj
Page 3: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

Abstract

A video database contains an enormous amount of information. In order tosearch through the database, metadata can be attached to each video. Onesuch type of metadata can be labels containing speakers and where they arespeaking. With the help of speaker recognition this type of metadata canautomatically be assigned to each video. In this thesis a speaker recognitionplug-in for Vidispine, an API media asset management platform, is presented.The plug-in was developed with the help of LIUM SpkDiarization toolkit forspeaker diarization and ALIZE/LIA RAL for speaker identification.

The choice of using the method of GMM-UBM that ALIZE/LIA RAL of-fers, was made through an in-depth theoretical study of different identificationmethods. The in-depth study is presented in its own chapter. The goal ofthe plug-in was to perform an identification rate of 85%. However, the resultsunfortunately became as low as 63%. Among the issues the plug-in faces, itslow performance on female speaker was shown to be crucial.

Page 4: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

ii

Page 5: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

Contents

1 Introduction 1

1.1 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3

2.1 Speech Parameterization . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . 4

2.3 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . 4

2.4 Speaker Diarization . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4.1 Voice Activity Detection . . . . . . . . . . . . . . . . . . 5

2.4.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.5 Speaker Identification . . . . . . . . . . . . . . . . . . . . . . . 7

2.5.1 Closed- and Open-Set . . . . . . . . . . . . . . . . . . . 7

2.5.2 Supervision . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.6 Vidispine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Problem Description 9

3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Goal and Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Comparison of Speaker Identification Methods 13

4.1 Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . . . 13

4.1.1 EM-ML Algorithm . . . . . . . . . . . . . . . . . . . . . 14

4.1.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2 Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2.1 Codebook Construction . . . . . . . . . . . . . . . . . . 15

iii

Page 6: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

iv CONTENTS

4.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . 15

4.4 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . 17

4.4.1 Backpropagation . . . . . . . . . . . . . . . . . . . . . . 18

4.5 Modern Additions . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.5.1 GMM-UBM . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.6 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.6.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Vidispine Plug-in 23

5.1 Implementation Choices . . . . . . . . . . . . . . . . . . . . . . 23

5.2 Vidispine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.3 Speaker Diarization . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3.2 Voice Activity Detection . . . . . . . . . . . . . . . . . . 25

5.3.3 BIC Segmentation . . . . . . . . . . . . . . . . . . . . . 26

5.3.4 Linear Clustering . . . . . . . . . . . . . . . . . . . . . . 26

5.3.5 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . 26

5.3.6 Initialize GMM . . . . . . . . . . . . . . . . . . . . . . . 26

5.3.7 EM Computation . . . . . . . . . . . . . . . . . . . . . . 26

5.3.8 Viterbi Decoding . . . . . . . . . . . . . . . . . . . . . . 26

5.3.9 Adjustment of Segment Boundaries . . . . . . . . . . . . 27

5.3.10 Speech/non-speech Filtering . . . . . . . . . . . . . . . . 27

5.3.11 Splitting of Long Segments . . . . . . . . . . . . . . . . 27

5.3.12 Gender and Bandwidth Selection . . . . . . . . . . . . . 27

5.3.13 Final Clustering . . . . . . . . . . . . . . . . . . . . . . 27

5.3.14 Postprocessing . . . . . . . . . . . . . . . . . . . . . . . 27

5.4 Speaker Identification . . . . . . . . . . . . . . . . . . . . . . . 28

5.4.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 29

5.4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.4.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.4.4 Postprocessing . . . . . . . . . . . . . . . . . . . . . . . 30

6 Results 33

6.1 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . 33

6.1.1 Testing and Training Set . . . . . . . . . . . . . . . . . 34

6.1.2 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.1.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.2 Testing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Page 7: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

CONTENTS v

6.2.1 DET Curve and EER . . . . . . . . . . . . . . . . . . . 36

6.2.2 Identification Rate . . . . . . . . . . . . . . . . . . . . . 37

6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.3.1 Gender Difference . . . . . . . . . . . . . . . . . . . . . 38

6.3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.3.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.3.4 Test Data and Filtering . . . . . . . . . . . . . . . . . . 39

6.3.5 Possible Improvements . . . . . . . . . . . . . . . . . . . 39

7 Conclusions 41

7.1 Present Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

8 Acknowledgements 43

References 45

Page 8: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

vi CONTENTS

Page 9: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

List of Figures

2.1 The result obtained by performing speaker diarization . . . . . 5

2.2 Illustration of speaker identification for an open-set database . 8

4.1 Conceptual presentation of two-dimensional VQ matching for a

single speaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2 The principles behind SVM . . . . . . . . . . . . . . . . . . . . 17

4.3 An MLP with inputs x, a single hidden layer and an output y . 18

5.1 The use of LIUM for speaker diarization. . . . . . . . . . . . . 28

5.2 Illustration of the speaker identification process . . . . . . . . . 29

6.1 DET curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.2 FA- and ME-probability in relationship to the threshold, to-

gether with the number of SegE for the whole dataset . . . . . 37

vii

Page 10: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

viii LIST OF FIGURES

Page 11: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

List of Tables

6.1 Threshold and EER for the dataset . . . . . . . . . . . . . . . . 37

6.2 Overview of the distribution of the errors . . . . . . . . . . . . 38

6.3 Identification and segment accuracy . . . . . . . . . . . . . . . 38

ix

Page 12: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

x LIST OF TABLES

Page 13: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

Chapter 1

Introduction

When humans listens to other people talking, they automatically extract in-formation about what they are hearing. Not only do they understand what isbeing said, but also who is speaking and in what language. Making softwarethat can do the same automatically, is not a simple task. These are examples ofthree types of information that are being extract when working with a speechsignal. These further leads to these three recognition fields:

Speech recognition: Extraction of words, to answer “What is being said?”.

Language recognition: Recognition of language, to answer “Which languageis spoken?”.

Speaker recognition: Recognition of speakers, to answer “Who is speak-ing?”.

Of these three types of automatically recognition, this thesis will focus on thelast, Speaker Recognition.

Speaker recognition is the task of automatically identifying who is speak-ing. Speaker recognition can be divided into two task; speaker verification andspeaker identification. The former seeks to verify a speaker against a statedspeaker model. This is what can be called an one-to-one comparison. It isoften used to identify if a person speaking is who he/she really says he/sheis. Speaker verification is therefore commonly studied with regard to the ap-plication area of biometric identification. This thesis will however focus onspeaker identification. The goal for this task is more general, and the goal issimply to know who is speaking. A speech segment is therefore compared witha database of speaker models, a so-called one-to-many comparison. To obtainthis database there must exist data of recorded speech from all the possiblespeakers in the database [10].

Automatically recognizing speakers is in itself a fascinating ability, but forwhat purpose can it be used for? As mentioned, the simpler task of speakerverification is mainly used for authentication tasks. More specifically one can

1

Page 14: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

2 Chapter 1. Introduction

imagine using it to get access to devices, facilities and web pages. Anotherarea for speaker recognition technology is in forensics, where the goal can beto prove that a specific person has said something or was at a certain location.The area that this thesis will focus on is information structuring. For example,one may wish to have an archive of videos and then be able to search throughthem by speaker. These types of application are often useful in the mediaindustry. The purpose might be to build a list of actors and where they arespeaking in a movie or to identify speakers in recorded or online meetings.

To achieve high accuracy, one can imagine that it would be convenient torecord specific words, and in that way recognize speakers by how they pro-nounce these words. Though in many cases this type of data is not available.For example, it would be convenient to recognize a person that does not knowhe/she is being identified. Therefore, a more flexible technique to executespeaker recognition is often used. The type of recognition that does not rely onwords is called text-independent and will be the type of recognition that willbe in focus in this thesis [3].

In a speaker recognition system the system is often divided into two separatephases that are often performed sequentially. The first phase is speaker diariza-tion, which is a form of preprocessing of the sound file where the segments wherea speaker speaks are identified. The output from speaker diarization is a set oflabels for which part of the sound file different unknown speaker are speaking.This can then be sent into a speaker identification module that identifies whothese unknown speakers are. Speaker diarization and speaker identification isoften treated as two separate research fields.

1.1 Thesis Structure

This thesis will in Chapter 2 give some background theory on the speakerrecognition field including description of the two phases, speaker diarizationand speaker identification. Chapter 3 describes the thesis project including aproblem statement and goals. After this, Chapter 4 is a presentation of a set ofspeaker identification methods together with a comparison of these methods.This chapter is a part of the in-depth study for this thesis. Description ofthe implementation of a prototype for a speaker recognition system can befound in Chapter 5. The methods for evaluating the prototype are explained inChapter 6 together with the results and a discussion of them. Lastly, Chapter 7summarizes the work done in this thesis together with suggestions of futurework. This chapter is followed by the Bibliography.

Page 15: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

Chapter 2

Background

This chapter gives some theoretical background in the field of speaker recog-nition. How speech is represented and a possible model to represent a speakerspeech is presented, together with more detailed information about speakerdiarization and speaker identification.

2.1 Speech Parameterization

In order to analyze speech signals as a statistical model and represent it in acompact manner, a feature vector from the speech signal is extracted. The aimis to capture the characteristic patterns in a speaker’s speech and then producea model that can represent these characteristics. Though there exist multipletypes of feature representations, the most commonly used is Mel-frequencycepstral coefficients (MFCC) [3]. Feature extraction with MFCC consists of aset of steps, which will now be presented.

High frequency often gets lowered under the productions of speech in thethroat, a filter is therefore applied to amplify the frequencies. Then, a windowsmaller than the whole signal is set to the start of the signal and constantlyshifted with a given time interval until it reaches the end of the signal. Andthe Fast Fourier transform (FFT) is applied to each window. To emphasizethe lower frequency components, mel-scale frequency warping is done. Thisincludes multiplying the spectrum with a filterbank, so that the frequencies aretransfered into the mel-frequency scale. This scale is similar to the frequenciesperceived by the human ear. In the filterbank each filter has a triangularbandpass frequency response. Applying a filter for each component in themel-frequency will produce the average spectrum around each center frequencywith increasing bandwidth. The formula to convert the frequencies f into mel-spectrum is as follows [23]:

fmel = 2595 · log10(1 +flin700

) (2.1)

3

Page 16: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

4 Chapter 2. Background

After this, the logarithm of the spectral envelope is computed to acquire thespectral envelope in dB. And finally the cosine discrete transformation is ap-plied and it is defined as:

cn =

K∑k=1

Sk · cos[n(k − 1

2)π

K], n = 1, 2, . . . , L (2.2)

where K is the number of log-spectral coefficients, Sk are the log-spectral coef-ficients and L is the number of cepstral coefficients to be calculated (boundedby L ≤ K)

2.2 Gaussian Mixture Models

A common approach in text-independent speaker recognition is to use GaussianMixture Model (GMM) to model speakers. A model consists of multi-modalGaussian probability distribution, which in other words means that it is aweighted sum of M component densities and can be defined as:

p(~x|λ) =

M∑i=1

wipi(~x) (2.3)

where ~x is a D-dimensional feature vector, wi is the mixture weights with theconstraint that

∑Mi=1 wi = 1, pi(~x) is a uni-modal Gaussian density on the

form:

pi(~x|λ) =1

(2π)D/2|∑i |

12

e−12 (~x−~µ)

∑−1i (~x−~µ) (2.4)

When GMM is used for speaker identification, each speaker to be identifiedis represented by a GMM. A speaker’s model, and its parameters, is referredto as λ = {pi, ~µ,

∑i}, i = 1, . . . ,M . GMM is more thoroughly described in

Section 4.1.

2.3 Hidden Markov Model

Another approach to speaker modeling is Hidden Markov Model (HMM). HMMcan be trained to model phrases or phonemes, because of this HMM is mostcommonly used in text-dependent speaker recognition. HMM is a type of afinite state machine with:

– A set of hidden states Q

– An output alphabet/observations O

– Transition probabilities A

– Output emission probabilities B

Page 17: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

2.4. Speaker Diarization 5

– Initial state probabilities Π

The system is only in one state at a time, and this state is not observable, butthe probability B produced from each state accessible. Q and O are fixed andtherefore the parameters of the HMM model can be defined as Λ = {A,B,Π}[19].

2.4 Speaker Diarization

Before starting to identify who is speaking, some preprocessing has to be doneto the audio file used as input. The goal of speaker diarization is to dividethe audio file into segments containing speech and cluster those segments thatassociate to the same speaker together. In short the task is to answer “Whois speaking when?”. Speaker diarization is a two-step task, where the first issegmentation and the second is clustering. Figure 2.1 illustrates the result fromspeaker diarization.

Figure 2.1: The result obtained by performing speaker diarization

2.4.1 Voice Activity Detection

In order to analyze which person is speaking, the parts of the audio file that in-cludes speech needs to be extracted, this process is often referred to as Voice Ac-tivity Detection (VAD). There exist multiple techniques for extracting speech,which includes measuring frame energy and the use of GMM. The first methodis only able to detect speech/non-speech, while with a method like GMM itwould be possible to add categories like laughter, music and background noise.This thesis will primarily be concerned with speech/non-speech detection.

2.4.2 Segmentation

In order to assess where there is a change of speaker, the segmentation looksfor change-points in the audio. Segmentation methods often fall down into twomain categories:

Metric based: Determines if two acoustic segments originate from the samespeaker by computing the distance between two acoustics segments. Met-

Page 18: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

6 Chapter 2. Background

ric based methods think of a speaker’s speech as having a Gaussian dis-tribution.

Model based: Models are with the help of supervision trained to recognizespeakers. The models are then used to estimate where there are change-points in the audio file. In practice, the segmentation task becomes anidentification task, where the models often are GMMs.

Metric based segmentation is the most popular type of method and is oftenfavored because no prior knowledge is needed [1, 31]. One of these metricbased methods is Bayesian Information Criterion (BIC) segmentation.

Bayesian Information Criterion

BIC segmentation is a type of metric-based segmentation. BIC is what is calleda penalized maximum likelihood model selection criterion. If there exists anacoustic segment X , how well a model M fits the data is represented by a BICvalue, which is defined as:

BIC(M) = logL(X|M)− λ

2#(M)logN (2.5)

where logL(X|M) is the log-likelihood, λ is a penalty weight usually set to 1, Nis the number of frames and #(M) is the number of parameters in M . When de-tecting change-points a window is initialized and continuously expanded whilecontinuously looking for change-points. The searching for change-points is re-ally a comparison between two models, one with two Gaussians and anotherwith just one. The difference between these models are expressed as

BIC(i) = R(i)− λP (2.6)

R(i) is the likelihood ratio and is defined as:

R(i) = Nlog|Σ| −N1|Σ1| −N2log|Σ2| (2.7)

where the∑

s are sample covariance matrices from the all data, {x1, . . . , xi} and{xi+1, . . . , xN} and the penalty, where d is the feature dimension, is expressedas:

P =1

2(d+

1

2d(d+ 1))logN (2.8)

If BIC(i) is positive, the model with two Gaussians fits the data best, and thesample represents a change-point [5].

2.4.3 Clustering

When clustering segments together the similarity between segments are exam-ined, and a hierarchy of clusters are constructed. Two main approaches tomaking such a hierarchy exist:

Page 19: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

2.5. Speaker Identification 7

Agglomerative: This is a bottom-up approach. Initially there is a singlecluster for each segment, and clusters are iteratively merged togetheruntil the set of cluster reaches an optimal size.

Divisive: This is a top-down approach. In the beginning all segments arestored in one big cluster. A new speaker is recursively introduced andthe cluster is split into smaller and smaller clusters.

Because agglomerative clustering has good balance between ease of the struc-ture and the performance, it is the approach most commonly used [7].

In some way clustering is simple form of identification. However, clusteringdiffers from speaker identification because clustering methods are performedwithout previous knowledge of the speakers in the audio file.

2.5 Speaker Identification

The next phase in speaker recognition is to identify which clusters correlatesto which speaker. The phase itself consists of two sub-phases, the first beingthe training phase and the second being the testing phase. The training phaseinvolves training a model for each speaker.

Though the exact definition of what a model consist of depends on themethods used, the main idea is to have a set of parameters that can representthe characteristics in a speaker’s speech. A description of different methodsfor extracting such a model can be found in Chapter 4. The training phasecan be done independently from a recognition system. If a set of speakermodels is trained and a database of these models are made, the database canbe incorporated as existing data in the recognition system. Functionality foradding new speakers may of course be added into such a system.

The testing phase takes data from a test speaker and compares this datato each model in the database. In this way the identity of the speaker in eachcluster can be identified.

2.5.1 Closed- and Open-Set

Speaker identification can be performed on a closed-set or an open-set of speak-ers. If the set is closed it is assumed that all the speakers being identified belongto the database. However, if the set is open, a decision about whether or not aspeaker belongs to the database needs to be performed. When dealing with anopen-set speaker recognition system, the identification task consist of both ascoring phase and a verification phase, as illustrated in Figure 2.2. A sample ofthe speaker is compared with the models of the speakers in the database givingeach speaker a score based on how good the sample and the model matches.By using a decision logic, the model having the best match to the speaker usingthe score value is chosen. The verification phase is then entered, and a decisionis made about whether or not the speaker belongs in the database. This is in

Page 20: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

8 Chapter 2. Background

most cases decided with the help of a threshold. And the decision of settingthe speaker as known or unknown is based on which side of the threshold thescore lies [24, 18].

Figure 2.2: Illustration of speaker identification for an open-set database

2.5.2 Supervision

Methods are often divided into supervised and unsupervised. Supervised meth-ods are when the identification models are trained using data from all speakersto be identified. These methods often model speakers by the difference be-tween a specific speaker and all other types of speaker, and because of this thedata needs in some way to be labeled. In this way the identification methodcan know which portion of the data belongs to the speaker being modeled andwhich belongs to the “others”. The labels used needs to be assigned by a su-pervisor. For unsupervised methods the identification models are trained usingdata from only the speaker it represents, a labeling is therefore not necessary.

2.6 Vidispine

Vidispine is an API media asset management platform, which among otherthings has support for file management, metadata management and media con-version. The metadata management allows Vidispine to offer advanced searchover a media database. To facilitate the metadata management, Vidispine of-fers the ability to automatically populate the metadata for a video through aset of plug-ins that performs automatic recognition.

Page 21: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

Chapter 3

Problem Description

The goal of this project is to develop a robust solution for speaker recognitionintegrated into Vidispine (as a plug-in). The focus of this thesis will be on thepostprocessing of audio streams in order to extract the identity of the speakerfrom a segment, typically from a recorded meeting, video conference or TV-debate.

The result should include a demo that shows how it is possible to annotatethe media asset with metadata from the speaker recognition solution. Existinglibraries that could form the foundation for this solution should be evaluated.

The solution will handle the following, possibly individual tasks:

– The speaker segmentation task is to divide an audio stream into segments,each with a single speaker.

– The speaker clustering task is grouping the segments with the samespeaker together.

– The speaker identification task is to determine which speaker out of agroup that produces the voice segments in a cluster.

– The learning task is where the system learns how to identify a specificspeaker.

Limitations

– The solution only needs to handle a limited group of known possiblespeakers, in other word the number of speakers and which voices that isused in a specific audio stream is known.

– Learning could be a manual task.

9

Page 22: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

10 Chapter 3. Problem Description

3.1 Problem Statement

Is it possible to produce a speaker identification system with an 85% successrate for a set of 12 speakers where the gender division is fairly equal? The testdata should be a TV-debate, and unknown speakers should occur.

3.2 Goal and Purpose

The goal of this project is to develop a robust solution for speaker recognitionintegrated into Vidispine (as a plug-in). When speaker recognition is integratedinto Vidispine, the system can allow search for time labels for where a specificperson is speaking in a video. Furthermore, the plug-in can combined withspeech recognition be a powerful tool, which allows searching for where a certainperson is saying a specific phrase.

There should also exist a theoretical comparison of different speaker iden-tification methods, and an evaluation of which method is best suited for thisapplication. The purpose of this research is to examine the field of speakerrecognition and the methods used for speaker identification. The theoreticalresearch should take into account that the implementation will be used as aplug-in for Vidispine. The project is done in cooperation and at the request ofthe IT consultancy firm Codemill located in Umea.

3.3 Methods

To achieve the goals set for the thesis, the following methods should be applied.

Literature Research: A general study of the theoretical background ofspeaker recognition should be performed. Additionally, the speaker identifi-cation should be studied more extensively.

Evaluation: Discussion and evaluation of the speaker identification meth-ods, together with an evaluation of the existing software libraries that may berelevant for the implementation for these methods.

Design: Planning and designing the implementation of the plug-in and theevaluation methods used for testing.

Implementation: Programming a prototype of the plug-in in C++ as thisis the programming language used in Vidispine. The prototype functionalityneeds to be tested and refined.

Testing: Writing an evaluation program that can test the prototype againsta manually acquired correct speaker-label file.

Page 23: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

3.4. Related Work 11

3.4 Related Work

Codemill has two other student working on similar project connected to videoinformation analysis, namely speech and face recognition. The goal is to com-bine these three application into one powerful video analyzer and search tool.

Examples of articles that focus on speaker identification in the context ofbroadcast and video analysis are [12] and [2].

Page 24: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

12 Chapter 3. Problem Description

Page 25: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

Chapter 4

Comparison of SpeakerIdentification Methods

As a part of an in-depth study for this thesis a presentation of four differentspeaker identification methods will be presented. The chapter will end with acomparison of the methods together with a discussion about which method isbest suited for the problem presented in this thesis.

The methods to be analyzed are:

– Gaussian Mixture Model

– Vector Quantization

– Support Vector Machine

– Artificial Neural Network

The methods chosen are all classic and basic approaches to speaker identifica-tion. Some studies show that combining methods lead to better performance,but since the main goal of this in-depth study is to go deeper into the methodsand not seeking the ultimate result, the methods are studied separately. Afterdescription of the mentioned methods, a brief presentation of the most modernapproaches is given.

4.1 Gaussian Mixture Model

Gaussian Mixture Model (GMM) is introduced in Section 2.2, and is discussedmore in detail in this section. Each speaker in the database is represented by aGMM λ. When training the GMMs, the goal is to estimate the parameters ofλ. The Expectation-Maximization (EM)-Maximum Likelihood (ML) Algorithmis the most popular method for this task.

13

Page 26: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

14 Chapter 4. Comparison of Speaker Identification Methods

4.1.1 EM-ML Algorithm

The ML estimation calculates the model parameters maximizing the likelihoodof the GMM, according to a set of training data. This likelihood is defined as:

p(X|λ) =

T∏t=1

p(~xt|λ) (4.1)

where T is the number of serialized training vectors X = {~xi, . . . , ~xT }.The ML parameter estimates is obtained by applying the iterative EM al-

gorithm. Each EM iteration begins with an initial model λ and ends with anew estimated model λ so that p(X|λ) ≥ p(X|λ). The new model becomes theinitial model for the next iteration. The process continues in this fashion untilthe maximum is found with an accuracy within a given threshold [25].

4.1.2 Identification

To identify who is speaking, a search for the λ that maximizes the probabilityis performed. The identification rule for a set of S speakers and a set of X ={~x1, . . . , ~xT } observations from a speaker (e.g. extracted from a sound file) canbe defined as:

S = arg max1≤k≤S

T∑y=1

log p(~xt|λk) (4.2)

where p(~xt|λk) was defined in Equation 2.3 [25, 3].

4.2 Vector Quantization

Vector quantization (VQ) originates from the 1980s and is simple and easilyimplemented model for text-independent speaker identification [32]. Originallyindented to be used for lossy data compression, VQ is in speaker modelingused to represent a large amount of data as a small set of points. This isdone by constructing a VQ codebook for each speaker, using training data,which contains a small number of entries. These entries are code vectors C ={c1, . . . , cK} and each of the code vectors is a representation of a cluster (notto be confused with the audio cluster) formed by clustering the feature vectorsto a speaker. The code vector is a centroid of the cluster. This approach iscalled clustering.

The average distortion function is calculated to match two speakers:

DQ(A,C) =1

T

T∑i=1

mini≤j≤K

d(ai, ci) (4.3)

given a set of T feature vectors A = {a1, . . . , aT } from an unknown speakerand where d is a distance measure between a vector and the closest centroid

Page 27: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

4.3. Support Vector Machine 15

in the codebook. The smaller the value of DQ is, the higher the likelihoodthat these feature vectors originates from the same speaker is. This meansthat when performing VQ for closed-set speaker recognition, the speaker thatcorresponds to the codebook that generates the smallest DQ is chosen. A visualpresentation of the concept for VQ is shown in Figure 4.1 [32, 8].

Figure 4.1: Conceptual presentation of two-dimensional VQ matching for asingle speaker

4.2.1 Codebook Construction

For each speaker in the database, a VQ codebook is constructed. Though [15]concludes that the crucial point of generating a codebook is not the methodsused for clustering, but the size of the codebook, a well-known algorithm ispresented, namely the LBG algorithm. The algorithm is named after its au-thors Y. Linde, A. Buzo and R. Gray which presented their work in [17]. Thealgorithm can be described as a split and cluster algorithm because a codebookis generated by, for each iteration, splitting a codebook vector in two and thenclustering all the new codebook vectors [35]. An overview of the algorithm ispresented in Algorithm 1.

4.3 Support Vector Machine

Support vector machine (SVM) is a powerful and robust method for speakeridentification. The idea behind the method is to create a hyperplane that candivide two speakers from each other with as big margin as possible. Whencreating an SVM, the process starts with a given set of training vectors for twoclasses. These training vectors are labeled with 1 or −1 by a supervisor. Withthe use of these training data, a hyperplane is created in such a manner thatthe margin between the hyperplane and the training vectors are maximized.The vectors that define how big this margin is, are called support vectors.

Page 28: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

16 Chapter 4. Comparison of Speaker Identification Methods

Data: A set of training vectorsResult: A codebook of size NInitialize a codebook C0 with a centroid generated by taking the averageL;N = desired codebook size;ξ = threshold;M = 1;while Codebook size M < N do

Split all codebooks in two, so that Ci = (1 + ε)Ci and Cj = (1− ε)Cjfor splitting parameter ε;M = 2M ;while Average distance ≥ ξ do

Assign every training vector to a cluster according to thecentroids in the current codebook;Update the centroid for each cluster;

end

endAlgorithm 1: LBG algorithm for codebook construction

Methods that make decisions based on where features are divided into separatehyperplane is called discriminate. Figure 4.2 shows the principles behind SVM.

The identification of a speaker is done by the function:

f(x) =

N∑i=1

αiyiK(x, xi) + b (4.4)

where xi is support vectors derived from the training vectors, yi ∈ {−1,+1}are the labels of the support vectors and αi are their weights. K(x, xi) is thekernel function. The kernel function is defined as

K(x, y) = Φ(x)TΦ(y) (4.5)

so that Φ(x) maps from the input space to a kernel feature space in a higherdimension. This is done because it is easier to find a linear hyperplane in thishigher dimension [16, 14].

Since SVM is a two-class identifier there exist methods for making multi-class SVMs. Two of the approaches are called one-against-all and one-against-one. The former seeks to construct k SVM models for k speakers, where eachmodel has a positive class representing a single speaker and a negative classrepresenting all other speakers. The SVM that has the highest value for f(x)defines which class the speaker belongs to. The seconds approach one-against-one pairs up all speakers and construct k k−12 SVM models. Each SVM istrained to model the difference between two speakers. Deciding which class aspeaker belongs to can be done in multiple ways. For example, in a strategycalled “max wins”, each SVM is run and for each “win” a speaker gets, it

Page 29: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

4.4. Artificial Neural Network 17

receives a vote. The speaker class that receives most votes in the end is thewinner. Another strategy is to make a knock-out-system, where the winnerof each “fight” is paired up again for a new fight until there is one winner.For practical application the one-against-one method is shown to be favorable[13, 11].

Figure 4.2: The principles behind SVM

4.4 Artificial Neural Network

Artificial Neural Network (ANN) is a method inspired by the way neurons op-erate in the human brains. The method works very well for problems wherelimited info about the problem and its input characteristics is known. Thoughthere exists multiple ways to represent an ANN, the focus here is on a type offeed forward network called multilayer perceptrons (MLP). The MLP architec-ture consists of an input layer, zero or more hidden layers and an output layer.The layers and its nodes are connected with weights [9]. Figure 4.3 gives anillustration of an MLP.

The usage of MLP to identify a speaker is similar to those methods usedfor SVM. For k speaker, k(k−1)/2 MLPs is created, where each model decidesbetween two speakers and a strategy for selecting the ultimate winner hasto be chosen. An existing alternative is to make k MLPs where the modeldecides between a single speaker versus an input representing all other speakers.Research by Rudasi and Zahorian (1991) concludes that using this type ofbinary-pair network has better performance than constructing a single largenetwork [28]. The single network especially suffers from long training time,which grows exponentially with the number of speakers. MLP generally hasproblems with large populations. For example, the MLP often converges tolocal optimums instead of global. Also, if the majority of inputs should yield0 and only a small percentage yield 1, the positive answer are interpreted asdeviations and the MLP learns that all inputs should yield 0 [22].

Page 30: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

18 Chapter 4. Comparison of Speaker Identification Methods

Figure 4.3: An MLP with inputs x, a single hidden layer and an output y

4.4.1 Backpropagation

There exists several methods for training the MLP. Here, a method calledbackpropagation is presented. If the goal is to make a binary network it ispossible to create an MLP where the inputs xi are feature vectors from aspeaker and an output layer with one node yielding an output y = {0, 1}. Eachnode consists of an input function and an output function. These functions candiffer, but one example is to let the first function summarize the total inputby the equation

∑Nj xiwi,j for a node i and all its preceding nodes j, while

the output activation function is a simple sigmoid function. When the networkis being trained the process consists of finding the set of weights that, to thehighest degree as possible, ensures that for a given input, the desired output isacquired [29].

Initially all the weights are random values. For each set of test data, twophases are conducted, feed-forwarding and backpropagation [30]. The feed-forwarding phase feeds test data to the network and looks at the resultingoutput, and calculates the output error E = y − d where d is the desiredoutput. The backpropagation phase iterates backwards through the networktrying to correct the weights. The derivation of the activation function is usedto try to predict how much the hidden layers affect the output. The weightsthat are leading to the output is updated according to

wj,i = wj,i + yajEg′(ini) (4.6)

where aj is the output from the activation function in the hidden node and g′ isthe derivation of the activation function. This is followed by backpropagationthrough the rest of the layers updating the weights according to

wk,j = wk,j + αIk∆j (4.7)

Page 31: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

4.5. Modern Additions 19

where α is the learning rate, Ik is the input to the weight and

∆j = g′(inj)∑i

wj,iEig′(ini). (4.8)

4.5 Modern Additions

Here is a presentation of some of the modern approaches to speaker identifica-tion.

Fusion: The method of combining pieces of information from several methodsto enhance the full information [16].

Supervectors: Mapping a large set of low-dimensional vectors into a singlehigh-dimensional vector. This is for example used to train GMM speakermodels, which is transformed by its parameters into a supervector thatis used as input to an SVM [33, 16].

GMM-UBM: Is a technique to make the GMM perform well for a small set oftraining data. Since each model is trained independently, a small amountof training data leads to a very sparse model [27]. So the GMM-UBM alsoadds knowledge of the whole world of speakers to make a richer model.The method is further described in following subsection.

4.5.1 GMM-UBM

Instead of creating a single model for each speaker, a speaker-independentmodel Universal Background Model (UBM) λ0 is constructed. The UBM istrained with data from all speakers. In order to derive the speaker-dependentmodel λ Maximum A Posterior (MAP) estimation is performed [27]. To get thescore for an observation the log likelihood ratio is computed between a speakermodel and the UBM:

S(X ′|λ, λ0) = logp(X ′|λ)

p(X ′|λ0)=

1

T ′

T ′∑t=1

logp(~x′t|λ)

p(~x′t|λ0)(4.9)

where X = {x1, . . . , xT ′} is a set of feature vectors [34].

4.6 Comparison

As explained in Section 2.5.2, identification methods can be either be supervisedor unsupervised. Of the ones introduced in this chapter the Support Vector Ma-chine and the Artificial Neural Network are supervised, while Gaussian MixtureModels and the Vector Quantization are unsupervised. The supervised meth-ods have the advantage that they create models defining the difference betweena specific model and the rest of the database of models. This task demands a

Page 32: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

20 Chapter 4. Comparison of Speaker Identification Methods

big set of training data and it makes the methods more computational heavythan the unsupervised methods. The unsupervised methods have the advan-tage that it is not necessary to reconfigure the whole database of models everytime a new speaker is added. Also, the unsupervised methods benefit fromdemanding a smaller amount of training data, making the methods less com-putation heavy. The SVM and ANN are discriminative models where a modeldescribes the difference between speakers, while GMM and VQ are generativemethods. Generative methods describe the distribution within a variation ina speaker’s speech where likelihood and probabilities are used for making thedecisions about the identification [16].

The main advantage of GMM is that it has a proven high accuracy andthat it is computationally inexpensive, but it does require a sufficient amountof data to perform well [25, 26]. SVM has also been shown to have goodperformance and the method makes it easy to determine a strict line betweentwo separate classes [16], on the other hand the performance depends on akernel function, which is not always easy to choose. This is in contrast to theVQ which has a very easy structure to implement, it is also computationallyefficient and only needs a small set of training data. The VQ is, however,less robust and is, therefore, easily affected by noisy data and the variabilitythat occurs in speech. The discriminative function used in ANN makes it verypowerful and the architecture is adaptable, but finding the ideal structure andnumber of hidden layers together with the number of nodes in each layer isdifficult. Additionally the performance rapidly decreases when the number ofspeaker in the database increases.

4.6.1 Evaluation

When evaluating which method to use in the application, a set of properties ofthe data used in the application needs to be evaluated. The considerations tobe made can be summaries as follows:

– There is a limited amount of data

• When a new speaker is added to the database, the user should onlyneed to add a small and convenient amount of training data. Theamount of data for testing will vary, and the application shouldtherefore be able to handle small amount of test data.

– Only a limited number of different speakers are to be recognized, whichmeans that the database will be small.

• Although this thesis focuses on a given small set of speaker, it wouldbe convenient if the system could easily add new speakers. Thesystem should have the potential to have a larger database.

– The data originates from many different types of settings.

Page 33: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

4.6. Comparison 21

• There is little control over the number of microphones and amountof noise. Therefore, the method should be as robust as possible.

– The goal is a robust solution with high accuracy.

Of these properties robustness and high accuracy is the most important. VQhas both lower performance than the other methods and is easily affected bynoise, and it can quickly be excluded from the best choices. The methods thatfulfill accuracy and robustness best are SVM and GMM. GMM only performswell when given a sufficient amount of training data. On the other hand, theperformance of SVM is highly affected by finding the right kernel function,which is a similar to a problem with ANN. Choosing to implement ANN willmean that a lot of time will be consumed by finding the right architecture. Thisis of course relevant because the work in this thesis is highly limited by time.Additionally, ANN has problems meeting the wish for an additional featurethat allows for the speaker database to be extended.

Based on this short summary, the GMM and SVM seem like the best al-ternatives. Even though there still exists some problems with both. GMM hasthe advantage that there exists an easy extension to the method in the formof GMM-UMB. This method eliminates GMM’s main problem concerning theamount of training data needed to ensure good performance. What is left is amethod that offers high accuracy, despite a limited training data. SVM doesalso hold these properties, but in comparison with GMM, the SVM is moreprone to noisy training data making it a less robust method than GMM. In ad-dition, SVM is more computationally heavy. The GMM and more specificallyGMM-UBM becomes the favored alternative. Though it is less important inthe reasoning in this chapter, the fact that GMM-UBM in the practical contextof this thesis is easier to implement is a very positive property.

Page 34: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

22 Chapter 4. Comparison of Speaker Identification Methods

Page 35: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

Chapter 5

Vidispine Plug-in

This chapter includes a description of how the prototype of the Vidispine plug-in is implemented, with focus on the software libraries that are used.

5.1 Implementation Choices

The prototype consists of two main modules, one for speaker diarization andone for speaker identification. Dividing the prototype into two separate mod-ules in such a way, means that there is a clear division between the unsupervisedpart and the supervised part of the system. The speaker diarization preparesthe audio file in such a manner that the speaker identification module can solelyconcentrate on recognizing the individual speakers. Speaker identification of-ten requires a sufficient amount of data to perform well, and it is therefore anadvantage that all the data belonging to the same speaker is clustered together.In that way the identification can be performed on a larger set of data, andhopefully achieve higher accuracy.

The plug-in is implemented as a text-independent speaker identificationsystem for an open-set. The application of the plug-in, as a video analyzer,clearly makes this a speaker identification problem. The first of two reasonsfor making the plug-in text-independent is the difficulty of obtaining recordsof all speakers in the database speaking specific words. The second reasonis that a text-independent system also is language-independent. It is unclearfor what language the plug-in will be used, but both Swedish and English arelikely candidates. In practice, it is hard to have models for all speaker thatmay appear in a video, and it is for this reason the set is chosen to be open.Open-set text-independent speaker identification is considered the hardest typeof speaker recognition [12].

23

Page 36: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

24 Chapter 5. Vidispine Plug-in

5.2 Vidispine

To perform speaker recognition a user will communicate with Vidispine byrequesting that speaker recognition should be performed on a specified video.Vidispine will then convert the video file into a WAVE audio file, with 1 channeland a sample rate of 16kHz. This becomes the input to the speaker recognitionsystem, which first phase is the speaker diarization. When the speaker recog-nition is finished it converts the information extracted into a metadata formatand updates the metadata in the Vidispine database.

5.3 Speaker Diarization

To implement speaker diarization several open source tools where considered,among them where LIUM SpkDiarization1, ALIZE’s LIA RAL2 package,SHoUT3 and AudioSeg4. A short summary of the author’s impression of thetools are:

LIUM SpkDiarization: A set of tools written in Java dedicated to speakerdiarization, developed for use on TV-shows. Therefore, it is necessary tomake a call to the JAR-file through the command line in order use thetools. The results are, however, satisfying, after finding the right type oftest data.

ALIZE/LIA RAL: A high-level toolkit written in C++. Even though sev-eral articles state that they used ALIZE/LIA RAL for their implemen-tation, implementing something that produces satisfying result was notsuccessful. Though time was a factor, the main problem was the lack ofdocumentation.

AudioSeg: A toolkit devoted to audio segmentation and indexing written inC. Since the tools where at such a high level, great flexibility was notavailable. With limited documentation, figuring out how to manipulatethe result to become acceptable was hard. In addition, Audioseg itselfstates that the purpose of the toolkit is to help prototyping.

SHoUT: Is a speech recognition toolkit written in C++ and developed solelyby Marijn Huijbregts in his Ph.d. work. The toolkit includes support forspeech/non-speech detection and speaker diarization.

LIUM SpkDiarization (from here on only referred to as LIUM) was chosen asthe tool used for implementing the speaker diarization module. ALIZE/LIA RALwas excluded because of lack of good documentation, and AudioSeg offered too

1http://www-lium.univ-lemans.fr/diarization/doku.php/welcome2http://mistral.univ-avignon.fr/index_en.html3http://shout-toolkit.sourceforge.net/index.html4audioseg.gforge.inria.fr/

Page 37: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

5.3. Speaker Diarization 25

little flexibility. Both LIUM and SHoUT have good documentation with anonline wiki, but LIUM was preferred because it allows greater flexibility. Theability to manually tweak some of the parameters was seen as more importantthan using a library that can directly be linked to the program in the waythat using a library that uses the same programming language can. Figure 5.1shows the process of using the tools in LIUM to perform speaker diarization[21].

5.3.1 Preprocessing

LIUM takes an audio file and a feature file as input. The input format thatLIUM accepts are Sphinx5, SPro46, gzipped-text and HTK7. The chosen toolfor this task is Sphinx, which is the format that the examples in the documen-tation use. Before the audio file received from Vidispine can be sent to LIUM,some audio transformation needs to be done.

Parameterization

The first step is to transform the audio file into .sph file type using SoX8. Then,features are extracted from the audio file into a feature file in Sphinx format.The tool sphinx fe is used to produce a feature files with 13 coefficients of thetype MFCC. A segmentation file with one big segmentation from frame 0 ton−1, where n is the total number of frames in the audio file is initialized. Theparameterization step ends with a safety check on the feature file. The firstsafety check makes sure that the file is as long as it is supposed to be, whilethe second ensures that series of multiple identical vectors does not exist. Theinitial segmentation and tests are done by using the LIUM tool MSegInit.

5.3.2 Voice Activity Detection

To segment away music and jingles the LIUM tool Decode is used. This toolperforms basic Viterbi decoding, with an input of 8 one-state HMMs, whereeach of these states are represented by a GMM composed of 64 Gaussianswith diagonal covariance. These GMMs have been trained by LIUM usingEM-ML. Where the feature vector consists of 12 MFCC coefficients (C0 isremoved) together with δ coefficient. The 8 HMMs represent silence (wideand narrow band), clean speech, speech over noise, speech over music, speech(narrow band), jingles and music.

5http://cmusphinx.sourceforge.net/6http://www.irisa.fr/metiss/guig/spro/spro-4.0.1/spro_4.html7http://htk.eng.cam.ac.uk/8http://sox.sourceforge.net/

Page 38: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

26 Chapter 5. Vidispine Plug-in

5.3.3 BIC Segmentation

BIC segmentation is performed using the LIUM tool MSeg. The tool does nottake the output from VAD as input, but rater does change detection on theaudio file as a whole. The goal of this segmentation is to separate the audiofile into “homogeneous” segments.

5.3.4 Linear Clustering

In order to merge those segments originating from the same speaker lying sideby side, linear clustering is run. This is done with the LIUM tool MClust withthe option –cMethod=l. This option makes the algorithm go from start to endin the audio file. The algorithm evaluates the similarity value on the diagonalof the similarity matrix in the Gaussian model. The clustering algorithm usesBIC similar to the one used for segmentation, but here the consecutive segmentsare considered.

5.3.5 Hierarchical Clustering

In order to perform agglomerative hierarchical clustering the LIUM tool MClustwith the option –cMethod=h is used. As described in Section 2.4.3 each seg-mentation starts out as a cluster and clusters that are similar (according toBIC) is iteratively merged together until no ∆BICi,j > 0 for clusters i and j.

5.3.6 Initialize GMM

For each segment a GMM is initialized, and this is done by running the LIUMtool MTrainInit.

5.3.7 EM Computation

To train the GMMs initialized in the previous step the LIUM tool MTrainEMis run. The tool uses the EM algorithm. The training is done on the segmentsof the clusters computed in the hierarchical clustering. The computation aimsto train the GMMs by estimating the GMMs parameters λ. To do this, MLestimation is done together with the EM algorithm.

5.3.8 Viterbi Decoding

A new segmentation is now generated using Viterbi decoding using the LIUMtool MDecode. The clustered segments from the hierarchical clustering are usedas input together with the GMM trained in the previous step. The Viterbidecoding models a cluster according to a HMM with one state represented bya GMM and finds the most likely sequence of hidden states from the HMMparameters and observation sequence [19].

Page 39: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

5.3. Speaker Diarization 27

5.3.9 Adjustment of Segment Boundaries

Because the segment boundaries produced in the Viterbi decoding are notperfect, the LIUM tool SAdjSeg is used to adjust them. The tool ensures thatthat the boundaries are set to areas with low energy.

5.3.10 Speech/non-speech Filtering

At this point the speech/non-speech segmentation conducted early in the pro-cess is brought back. The segmentation is filtered with the speech/non-speechsegmentation using the LIUM tool SFilter.

5.3.11 Splitting of Long Segments

It is often convenient that the segment are shorter than 20 seconds. To ensurethis the LIUM tool SSplitSeg is used to split up long segments at low energyareas. The tool takes GMMs trained by LIUM.

5.3.12 Gender and Bandwidth Selection

Detection of gender and bandwidth is done with the LIUM tool MScore to-gether with GMMs trained by LIUM. Each cluster gets a label with female/maleand narrow/wide band so that the characteristics of the GMM maximizes thelikelihood of the features in the cluster.

5.3.13 Final Clustering

All previous steps use features that are not normalized because it preservesinformation and help distinguish speakers. However, the process is now at apoint where each cluster should contain a single speaker. It might be a problemat this stage that multiple clusters have the same speaker. Therefore, a finalhierarchical agglomerative clustering with the LIUM tool MClust with option –cMethod=ce is run. The Universal Background Model (UBM) trained by LIUMis used, which is a fusion of the GMMs found in the gender and bandwidth step.

After going through all these steps, the result is in a form that can be usedby the speaker identification module. The important properties of the finalresult is that the segments are shorter than 20 seconds, contains a single voicethat is labeled with gender and bandwidth.

5.3.14 Postprocessing

The final result from the LIUM computation is a label file where each linerepresents a segment in the audio file. Each line states among other things starttime, length, gender and which cluster the segment belongs to. To prepare forspeaker identification the audio file is split and merged into a single audio filefor each cluster.

Page 40: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

28 Chapter 5. Vidispine Plug-in

Figure 5.1: The use of LIUM for speaker diarization.

5.4 Speaker Identification

The selection of speaker identification methods was based on the in-depth studydone in Chapter 4. There it is argued that GMM is the best alternative.An important additional feature to consider when choosing a method, is theavailability of software libraries. A list of some libraries that are available forspeaker identification is presented below:

Torch:9 State-of-the-art SVM and Neural Networks library written in Lua.

HTK:10 A low-level HMM Toolkit primarily directed towards speech recogni-tion.

ALIZE/LIA RAL: State-of-the-art GMM toolkit, specialized on speakerverification and written in C.

Toolboxes for MatLab: for example Statistical Pattern Recognition Tool-box11

Though there exists multiple examples of articles using ALIZE for speaker di-arization [7], the amount of documentation is sparse. For speaker identification,more examples and documentation exist and ALIZE is therefore easier to usefor this purpose. In addition to being a widely used toolkit and that it usesGMM-UBM, ALIZE/LIA RAL (from here on only referred to as ALIZE) ischosen as the software library for speaker identification in this implementation.Though ALIZE is written in C and therefore is integrable with C++, it is more

9http://www.torch.ch/10http://htk.eng.cam.ac.uk/11http://cmp.felk.cvut.cz/cmp/software/stprtool/

Page 41: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

5.4. Speaker Identification 29

convenient to use the existing set of the pre-compiled programs and start themthrough the command line. An illustration of the speaker identification moduleis shown in Figure 5.2 [4].

Figure 5.2: Illustration of the speaker identification process

5.4.1 Preprocessing

For all types of data that is used (training and testing) some preprocessingneeds to be done. Though ALIZE can be configured to use multiple types offormat for the feature files, SPRO4 is recommended by ALIZE. All trainingfiles are converted to .sph format through SoX and then the feature files areextracted using the SPRO command sfbcep. The command produces an MFCCwith a total of 34 vectors.

Silence Removal

To decide which vectors in the feature files is relevant, speech detection usingthe ALIZE tool EnergyDetector is performed. Speech segments in the trainingdata is acquired by taking the frames with the highest energy. The tool pro-duces a label file which states which portions of the data is speech. Before andafter silence removal is performed, the energy coefficients in the feature filesis normalized. For the normalization performed afterward only the segmentslabeled as speech segments are considered. The normalization is done using theALIZE tool NormFeat, which uses zero mean and unit variance normalization.

Page 42: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

30 Chapter 5. Vidispine Plug-in

5.4.2 Training

Before one can start to recognize the speakers, a model for each speaker in thedatabase needs to be trained. First, the UBMs are created and then individualGMM models for each speaker is extracted from the UBMs. Training is doneas a separate process and the UBMs and GMM models are saved as data inVidispine.

UBM training

To increase accuracy two UBMs are created, one for each gender. This canbe done because the information about the gender of the segments is obtainedthrough the diarization process. Each UBM is developed by using the ALIZEtool TrainWorld. The UBMs are trained through the EM algorithm with datafrom the speakers for each gender.

Model Extraction

For each speaker to be identified, a GMM model is extracted. The model is ex-tracted using data from the speaker and the UBM of the gender of the speaker.This is done by the ALIZE tool TrainTarget, which uses MAP estimation. Toincrease performance the test data used for model extraction is different fromthe one used to train the UBM.

5.4.3 Testing

When testing a speaker model in the speaker identification module, the seg-ments corresponding to an unknown speaker acquired from the speaker di-arization is used. To test the unknown segment the ALIZE tool ComputeTestis used. As input it takes the feature file of the segment, all the speaker GMMsand the UBM corresponding to the segments speaker’s gender. This processoutputs a score for each speaker considered. The speaker with the highest scoreis chosen as the identity of the segment speaker. Since the thesis focuses on anopen-set problem, the module also checks if the score exceeds a threshold. If itis lower than the threshold, the module labels the speaker as unknown.

5.4.4 Postprocessing

The result obtained by performing the previous steps is similar to the one fromthe speaker diarization module. The difference in the results are that insteadof the label file stating a cluster, the file contains the name of the speaker.The information obtained so far is somewhat unrefined, and a cleanup of theinformation is required to suit the application of the plug-in better. The speakerrecognition will occasionally recognize small sound outburst as someone talking.In most cases this is just random sound like laughter, background sound, etc.Therefore, to clean up the information, segments smaller than 1.9 seconds that

Page 43: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

5.4. Speaker Identification 31

have no neighboring segments within 5 seconds identified as the same speakerare filtered away.

The second postprocessing task that is performed on the output from thespeaker identification is merging. Considering that the application of the sys-tem is adding metadata for a video specifying where different speakers arespeaking and that this information is supposed to be used by for example asearch application, the information should be as short as possible without losingimportant data. For these reasons neighboring segments that have a maximum5 seconds between them and are identified as belonging to the same speaker aremerged together. As a practical example one can imagine a person searchingfor where a specific actor is speaking in a movie. The person searching is notinterested in getting a long list of segments when a multiple number of thesegments belongs to the same long monologue. Multiple segments may justbe a result of the actor taking small thinking pauses. Instead, the user justwants a single segment for the whole monologue. In theory the merged seg-ments constructed is true segments as described in the theory section. Ideallythese should already have been constructed perfectly by the speaker diarizationitself, but in most cases errors and inaccuracy exists. The merging is in a waya reassurance that reasonable result is acquired.

Page 44: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

32 Chapter 5. Vidispine Plug-in

Page 45: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

Chapter 6

Results

In this chapter, the evaluation methods for testing the prototype are presentedtogether with a discussion of the test result.

6.1 Evaluation Methods

In order to evaluate the performance of the system, a video will be used as testdata, and the number of correctly identified segments will be counted. Thisvideo has been manually labeled so that each label contains a true segment.This label contains information about the start and length of the segment,along with the correct speaker. A simple evaluation program comparing themanually derived label file and the label file produced by the system is used.This program also notes which types of errors the system has done. This isdone to enable evaluation of the factors that are contributing to reduce theaccuracy of the system.

The manual labeling of the test file is done by the author without the helpof any tools. As with all elaborate manual work the result is not perfect. Thelabel file contains some inaccuracy in the bounds of each segment. One canargue that for the main purpose of the application, perfect accuracy of theboundaries is not necessary. Furthermore, recognizing a small segment that isshot in into a sequence when someone is speaking is very difficult to handle. Anexample of this is when a person is speaking and another person asks somethingin one single word. The information about a person saying one or maybe twowords is not that interesting for someone searching for where someone is sayingsomething valuable. Taking very short segments into account can of course beinteresting in some applications, but in this thesis a choice is made to ignorethem. To cover up for these two possible flaws, a 5 seconds margin is allowedfor the segment’s bound and the options of ignoring small segments (less than2 seconds) in between segments from the same speaker. In that way, the twosegments are seen as one large segment.

33

Page 46: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

34 Chapter 6. Results

The evaluation program will be executed using a set of different thresholds.For the threshold producing the best result, an examination will be performedwith focus on the diarization part and the identification part.

6.1.1 Testing and Training Set

As stated in the problem description, the test data for this system should bea TV-debate. The number of speakers is allowed to be limited, but to ensurethat the system is not just lucky, a certain number of speakers is chosen. TwoTV-debate episodes from a Swedish TV-debate series with a total time of 3hours, 27 minutes and 34 seconds are chosen. The data consists of 15 people,where 6 are females and 9 are males. Of these people, models are trained forall but 1 female and 2 males.

The training data is taken from a similar environment as the test data. Twoseparate speech samples with the duration of 1 minute from each speaker areused to train the UBM and the GMM speaker models.

6.1.2 Errors

When the speaker recognition system defines a segment, it can make four differ-ent errors. If a segment is recognized as belonging to a speaker in the databaseand the identity is wrong, the error can either be that the segment does notbelong in the database or that the segment is identified as the wrong speaker.In this thesis the first error will be known as false accept (FA) and the seconderror as speaker error (SpkE). If a segment is identified as belonging to anunknown speaker and the speaker really does belong to the database, the errorwill be called false reject (FR). Note that this error is derived by the thresholddecision in the identification phase. The segment may originally be identifiedto either the correct or the wrong speaker. The last error is when a segmentdefined by the system has bounds that do not exist. This error is purely derivedfrom the speaker diarization task in the system and the error is called segmenterror (SegE).

6.1.3 Metrics

The main goal of this thesis is to look at the identification rate of the system.To obtain a better understanding of the performance of the system some othermetrics are also considered.

Detection Error Trade-off (DET)

A common evaluation metric for speaker verification is detection error trade-off(DET), which is illustrated with the help of a curve where the miss probabilityis plotted against the FA probability related to a set of thresholds. As purposedby [12], SpkE and FR can be grouped together into what is here called a misserror (ME). This can be done because, despite the cause of the error being

Page 47: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

6.1. Evaluation Methods 35

different, the result is that a speaker segment belonging to a known speakeris wrongfully identified. The FA probability is defined as the number of FAdivided by the number of segments that truly are unknown. Similarly, the MEprobability is defined as the number of ME divided by the number of segmentsthat truly belong to a speaker.

The point behind the DET plot is to show the trade-off between the twotypes of errors (ME and FA), that are related to the threshold. The closer thecurve is to the origin the better. In relations to this there exists a performancemeasure called equal error rate (EER) that is defined as the point where MEand FA are equal. Good performance is indicated by a small ERR and theERR determines the optimal threshold. In addition, a curve that is close to astraight line verifies a normal likelihood distribution in the underlying system[20].

Identification Rate

The accuracy of the system is metered by an identification rate (IR) which isdefined as the number of correctly found and identified segments divided bythe total number of true segments in the video. As explained previously thesystem is allowed to ignore small segments within or between another speaker’ssegments. This means that the total number of true segments is not a constantnumber. The program goes through the reference label file and searches for thesame segment in the label file that the system has computed.

Since the evaluation program counts the number of segments the systemgot right, the identification rate benefits from identifying many small segmentsas opposed to recognize fewer longer segments. Choosing this type of segment-based evaluation instead of a duration-based evaluation is done because thepurpose of the system is to index where people are speaking. When searching,the duration is secondary information. One can imagine searching for where aspeaker speaks and then watching a video from that point on.

Identification and Segment Accuracy

The system consists of two independently working modules, which togethercontribute to the IR value for the whole system. To get a better understandingof how well these two parts work, the accuracy of the two modules are com-puted. The segment accuracy is defined as number of segments found dividedby the number of segments that truly exist. The system is tested as a blackbox and therefore, this value can only say something about the segmentationpart of the diarization and nothing about the clustering. The identificationaccuracy is the number of correctly identified segments divided by the numberof segments found.

Page 48: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

36 Chapter 6. Results

6.2 Testing Results

In the following subsection the results from the testing are presented.

6.2.1 DET Curve and EER

Figure 6.1: DET curve

In Figure 6.1 the DET curve is shown. As mentioned previously the curveillustrates the relationship between the FA- and ME-probability in the dataset,and relates to the identification task in the system. Because of this, errorsrelated to segment misses (SegE) has not been taken into account when cal-culating the probabilities. The curve shows how the male identification issignificantly better than the female identification, with the total EER becom-ing 15.4%. The total EER is more affected by the male speakers because thetotal number of segments spoken by males is higher. Ideally the curves shouldbe linear, but the ME probability rapidly reaches its minimum and maximum,and there are few points laying between these edge values. In the case of thefemales, the ME probability is either 0 or 50. A reason for the appearance ofthe curves may be that the number of speakers is relatively low so that thereare fewer factors affecting the probabilities.

In Figure 6.2, which serves as a complement to the DET curve, the relation-ship between the FA- and ME-probability and different thresholds are shown.It is produced to help obtaining the optimal threshold. The optimal thresholdis where the FA- and ME-probability are equal, and for the total set of testdata the threshold should be 0.346. Though SegE is not taken into account,the SegE is shown in Figure 6.2 in a secondary Y-axis. This curve shows thatthe SegE is at its minimum a little bit before the point where FA- and ME-probability curves are crossing each other, but the match is quite good so the

Page 49: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

6.2. Testing Results 37

Figure 6.2: FA- and ME-probability in relationship to the threshold, togetherwith the number of SegE for the whole dataset

crossing point can therefore be chosen as the threshold for the whole system.

The graph also shows the EER for this threshold. This EER is the sameas found in the DET curve. A summary of the threshold and EER for the twogenders and the total data set is shown in Table 6.1. The results further showthat the system clearly has problems identifying the female speakers. For thefemale speaker the results for the EER value are different in the DET curveand the graph in Figure 6.2. As mentioned, the female ME probability is only 0and 50, and therefore, the line shown in Figure 6.1 is just a line drawn betweenthese two values. Since the ERR in Figure 6.2 takes more values into accountit is chosen as it is.

Threshold EER (%)

All 0.346 15.4Male 0.342 6.25Female 0.445 50

Table 6.1: Threshold and EER for the dataset

6.2.2 Identification Rate

When running the system with the optimal threshold, the identification rate isshown to be 63%, which is considerably lower than the goal of 85%. Lookingisolated at the different genders, the IR is 77% for male speakers and as low as49% for the female speakers. The distribution of the types of errors occurringis shown in Table 6.2.

The dominating error is the SegE, which may point toward the diarizationmodule as being the weaker link in the system. However, taking into accountthat all the other errors derive from the identification module, the total amountof errors coming from the identification module is larger the number of SegE

Page 50: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

38 Chapter 6. Results

Error Type All Male Female

SpkE 53 10 43SegE 102 37 60FA 13 4 6FR 22 15 6

Total 190 66 115

Table 6.2: Overview of the distribution of the errors

errors. This notion is further confirmed when looking at the identification anddiarization accuracies that are presented in Table 6.3.

Accuracy (%) All Male Female

Identification 79 88.4 66.9Segment 89.8 96.5 80.9

Table 6.3: Identification and segment accuracy

6.3 Discussion

This section presents the discussion of the results.

6.3.1 Gender Difference

For all parts of the system, finding and identifying female speakers are shown tobe much harder than finding and identifying male speakers. One can say thatthe system finds it difficult to distinguish the female voices from each other.Both the diarization and the identification module have problems separatingfemale voices from each other. The fact that the same problem occur for thetwo different methods and libraries that is tested, clearly shows that this seemsto be a general problem related to the basic properties of female voices. Thisnotion is in fact already a known problem, as studied in among others [6]. Inthis article, the mel-scale is discussed and shown to be quite ideal for males,but in lesser extent for females, with one explanation being the influence ofpitch in the female speech.

6.3.2 Clustering

The result in Table 6.3 shows how the identification module produces the major-ity of errors. However, the truth may not be that simple, and the identificationmodule is wrongfully blamed. With the help of the metrics used for testing thesystem it is possible to say something about all the sub-parts of the system,

Page 51: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

6.3. Discussion 39

except from the clustering task. This is unfortunate because the clustering mayaffect the result of the identification in a negative matter. The reason that theclustering is affecting the identification is that the identification is performedon the clusters and not the segments. This means that if the clustering is doneincorrectly the cluster may consist of multiple speakers. If this is the case, itwill confuse the identification. There is no guarantee that the cluster will beidentified as belonging to the main speaker in the cluster. Secondly, when theidentification of the cluster is done, all segments belonging in the cluster is setto that speaker. Per se, it is not the identification of the individual segmentthat is wrong. Without further testing on the individual parts of the system,it is of course hard to say that this is the main problem, but informal testingon the individual parts show signs that this is in fact the case.

6.3.3 Metrics

The metric values presented in this section are based on counting individualsegments. This means that recognizing short segments count the same as rec-ognizing a longer segment. In some context this can of course be seen assomewhat unfair. The results may therefore have looked better if the valueswhere measure in a different way. However, for the indented use of the plug-inin Vidispine as a search tool, the counting of the segments is considered a morefitting method.

6.3.4 Test Data and Filtering

To compensate for the somewhat unfair measurement used in the metrics,where the number of segments is counted, both the system and the evalua-tion program does a lot of filtering. This means that the segment bounds aremeasured with a considerable margin and that small segments are filtered away.This is introduced to enhance the performance of the system. The test datais taken from a real environment (TV-debate), and because of this it is notalways reasonable to expect a computer, or human, to find perfect boundariesand to identify the slightest changes. For example, in a TV-debate, multiplespeakers may sometimes talk at once. By performing filtering, the result is ofcourse affected in a positive direction. However, as the filtering in many wayscorrelates with the purpose of the plug-in in which small segments and noise isof less interest to the intended user. Because of this, the choice to filter feelsnatural, and not just a trick to make the results in this thesis look better.

6.3.5 Possible Improvements

Below some suggestions for small changes to the system that may help improv-ing the results are presented.

Page 52: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

40 Chapter 6. Results

Separate Thresholds

The relationship between FA and ME is not the dominating factor to the un-satisfying result. However, all improvements help. An easy idea to improvewhere the threshold is set, is to use a gender dependent threshold. The op-timal thresholds for the two genders are so far apart that it would seem thatsuch a change could give some positive effect.

Extended UBMs

Since the UBMs are trained using only data from the speakers that are in thetest set, it would be an idea to train the UBM using a bigger and more diverseset of speakers to make it more robust. Making a more robust UBM will also benice, if not to say necessary, extension if there is a wish to expand the speakerdatabase. This improvement directly applies to the identification module, andmay therefore help improve the identification accuracy and thereby improvethe IR.

Re-diarization

The system in its current state performs diarization on the whole test data.Among the information extracted from this process is the gender of the differ-ent segments. The diarization module has few problems separating the gendersfrom each other, but struggles more in the process of separating speakers withthe same gender. A possible improvement to the system may be to do a re-diarization. Diarization will then be done in two runs. In the first run thediarization is run on the whole data set. Then diarization is once again per-formed, but this time separately on the part of the data containing femalesand males. The difference between the speakers in each gender is hopefullyamplified by doing a re-diarization, and the system can hopefully distinguishthe speakers more easily.

Page 53: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

Chapter 7

Conclusions

In this chapter, a presentation of the present work performed in this thesis ispresented together suggestion for future work.

7.1 Present Work

In this thesis a prototype of a plug-in for Vidispine that performs speaker recog-nition has been developed and evaluated. The plug-in was developed with thehelp of two libraries, LIUM and ALIZE, for speaker diarization and speakeridentification. The plug-in is a text-independent speaker identification systemfor an open-set database with focus on application to videos with content re-lated to TV-debates, broadcast news and conference meetings. With the helpof this plug-in, the idea is to extend the application and develop a search toolthat makes it possible to search through a large media database based on agiven speaker stored in a database. The software development of this plug-inis to be seen as the main work of this thesis. Because of this, the majority oftime was spent on developing the plug-in.

While the choice of speaker diarization library was based on informal test-ing, the method used for speaker identification was obtained through a liter-ature study into several available and well-known methods. The choice wasmade after performing an evaluation of which method was most suitable foruse in the plug-in. The literature study and evaluation is part of a mandatoryin-depth study for this master thesis.

Evaluation of the plug-in was done by comparing a manually obtainedspeaker label file with a similar file produced by the plug-in. The results ofthe testing did unfortunately not reach up to the goals set beforehand. Theidentification rate became 63%, which is quite far from the goal of 85%. Fur-ther investigation into the result does show a significant difference in how thesystem handles the different genders. The male speakers are in a much higherextent recognized correctly. The hope is that in the future the system can be

41

Page 54: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

42 Chapter 7. Conclusions

further improved and extended. In the work of this thesis the author herself ex-perienced just how time-consuming manually labeling a video is. Additionally,doing speaker labeling manually also produced errors. Multiple times errors inthe manually labeled file was found when comparing it to the one produced bythe speaker recognition system.

It is always frustrating to not reach the goals one sets, but in this case it iseasy to see improvement that could be implemented if more time was available.The hope is that this thesis makes a good starting point for a plug-in that canbe used in Vidispine.

7.2 Future Work

If the aim is to improve the existing system, without changing the libraries,applying the changes proposed in Section 6.3.5 would be a suitable first step.Further, the potential of both libraries has not been fully examined and shouldbe looked into. Both libraries use a large set of parameters, and exploring thesefurther would be a natural continuation of this work. Better results can alsobe achieved by applying bigger changes to the system. A bigger change couldbe to replace the libraries. Since the two modules and the libraries they usework separately from each other, there is no restriction on replacing just one ofthe libraries if this is desirable. There is even room to use additional librariesto deal with the issues that exist, for example the gender problem.

When changing the library one may wish to find another library that usesthe same methods as the one in the existing system, but also to try othermethods. Of most interest would be to try the mores state-of-the-art methods.A suggestion is to use the combination of SVM and GMM, so that the work inthis thesis in regard to identification methods can be extended.

The purpose of the work herein is to integrate this plug-in with similarplug-ins for speech and face recognition into a single powerful search tool for amedia database. Further work will hopefully be continued into to developingthis search tool.

Page 55: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

Chapter 8

Acknowledgements

I would like to thank my supervisor Henrik Bjorklund for reading my reportand giving his advice and support. Thank you to Codemill for giving me theopportunity to work on this thesis. It has been inspiring to be part of a greatworking environment, where people are always there to help out. A specialthanks to my supervisor Thomas Knutsson for always being there to answerall my question, having a positive attitude and giving his support. Finally, abig thanks to my family for their unconditional support regardless of where inthe world I may be.

43

Page 56: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

44 Chapter 8. Acknowledgements

Page 57: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

References

[1] Xavier Anguera. Robust Speaker Diarization for Meetings. PhD thesis,Universitat Politecnica de Catalunya, 2006.

[2] Jing Bi and Shu-Chang Liu. A speaker identification system for videocontent analysis. In Intelligent Information Hiding and Multimedia SignalProcessing, 2008. IIHMSP ’08 International Conference on, pages 200–203, 2008.

[3] Frederic Bimbot, Jean-Francois Bonastre, Corinne Fredouille, GuillaumeGravier, Ivan Magrin-Chagnolleau, Sylvain Meignier, Teva Merlin, JavierOrtega-Garcia, Dijana Petrovska-Delacretaz, and Douglas A. Reynolds. Atutorial on text-independent speaker verification. EURASIP J. Adv. Sig.Proc., 2004(4):430–451, 2004.

[4] Jean-Francois Bonastre, Frederic Wils, and Sylvain Meignier. Alize, a freetoolkit for speaker recognition. In proc. ICASSP, volume 5, pages 737–740,2005.

[5] Scott Shaobing Chen and P. S. Gopalakrishnan. Speaker, environmentand channel change detection and clustering via the bayesian informationcriterion. pages 127–132, 1998.

[6] Mason Thompson Department, J. S. Mason, and J. Thompson. Gendereffects in speaker recognition. In Proc. ICSP-93, Beijing, pages 733–736,1993.

[7] N. Evans, S. Bozonnet, Dong Wang, C. Fredouille, and R. Troncy. Acomparative study of bottom-up and top-down approaches to speaker di-arization. Audio, Speech, and Language Processing, IEEE Transactionson, 20(2):382 –392, feb. 2012.

[8] Pasi Franti Evgeny Karpov, Tomi Kinnunen. Symmetric distortion mea-sure for speaker recognition. In Proceedings of the 9th International Con-ference on Speech and Computer, pages 366–370, September 2004.

[9] K.R. Farrell, R.J. Mammone, and K.T. Assaleh. Speaker recognition usingneural networks and conventional classifiers. Speech and Audio Processing,IEEE Transactions on, 2(1):194–205, January 1994.

45

Page 58: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

46 REFERENCES

[10] L. Feng. Speaker recognition. Master’s thesis, Informatics and Mathemati-cal Modelling, Technical University of Denmark, DTU, Richard PetersensPlads, Building 321, DK-2800 Kgs. Lyngby, 2004. Supervised by Prof.Lars Kai Hansen.

[11] Hou Fenglei and Wang Bingxi. Text-independent speaker recognition usingsupport vector machine. In Info-tech and Info-net, 2001. Proceedings. ICII2001 - Beijing. 2001 International Conferences on, volume 3, pages 402–407 vol.3, 2001.

[12] Chao Gao, Guruprasad Saikumar, Amit Srivastava, and PremkumarNatarajan. Open-set speaker identification in broadcast news. In ICASSP,pages 5280–5283. IEEE, 2011.

[13] Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multi-class support vector machines. Neural Networks, IEEE Transactions on,13(2):415–425, March 2002.

[14] S. M. Kamruzzaman, A. N. M. Rezaul Karim, Md. Saiful Islam, andMd. Emdadul Haque. Speaker identification using mfcc-domain supportvector machine. CoRR, abs/1009.4972, 2010.

[15] KilpelSinen T. Franti P. Kinnunen, T. Comparison of clustering algo-rithms in speaker identification. In IASTED Internat. Conf.on Signal Pro-cessing and Communications (SPC 2000), pages 222–227, Marbella,Spain,September 2000.

[16] Tomi Kinnunen and Haizhou Li. An overview of text-independent speakerrecognition: From features to supervectors. Speech Communication,52(1):12 – 40, 2010.

[17] Y. Linde, A. Buzo, and R.M. Gray. An algorithm for vector quantizerdesign. Communications, IEEE Transactions on, 28(1):84–95, January1980.

[18] Amit Malegaonkar and Aladdin Ariyaeeinia. Performance evaluation inopen-set speaker identification. In Proceedings of the COST 2101 Europeanconference on Biometrics and ID management, BioID’11, pages 106–112,Berlin, Heidelberg, 2011. Springer-Verlag.

[19] Amruta Anantrao Malode and Shashikant Sahare. Advanced speakerrecognition. International Journal On Advances in Internet Technology,4:443–455, July 2012.

[20] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki.The det curve in assessment of detection task performance. pages 1895–1898, 1997.

Page 59: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

REFERENCES 47

[21] S. Meignier and T. Merlin. Lium spkdiarization: an open source toolkit fordiarization. In CMU SPUD Workshop, Dallas (Texas USA), mars 2010.

[22] J. Oglesby and J.S. Mason. Optimisation of neural models for speakeridentification. In Acoustics, Speech, and Signal Processing, 1990. ICASSP-90., 1990 International Conference on, pages 261–264 vol.1, April 1990.

[23] D. O’Shaughnessy. Speech communications: human and machine. Instituteof Electrical and Electronics Engineers, 2000.

[24] Ravi P. Ramachandran, Kevin R. Farrell, Roopashri Ramachandran, andRichard J. Mammone. Speaker recognition - general classifier approachesand data fusion methods, 2002.

[25] D.A. Reynolds and R.C. Rose. Robust text-independent speaker identifica-tion using gaussian mixture speaker models. Speech and Audio Processing,IEEE Transactions on, 3(1):72–83, jan 1995.

[26] Douglas A. Reynolds. Speaker identification and verification using gaus-sian mixture speaker models. Speech Communication, 17(1-2):91–108,1995.

[27] Douglas A. Reynolds, Thomas F. Quatieri, and Robert B. Dunn. Speakerverification using adapted gaussian mixture models. In Digital Signal Pro-cessing, page 2000, 2000.

[28] L. Rudasi and Stephen A. Zahorian. Text-independent talker identificationwith neural networks. In Acoustics, Speech, and Signal Processing, 1991.ICASSP-91., 1991 International Conference on, pages 389–392 vol.1, Apr1991.

[29] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Neu-rocomputing: foundations of research. chapter Learning representationsby back-propagating errors, pages 696–699. MIT Press, Cambridge, MA,USA, 1988.

[30] Stuart J. Russell, Peter Norvig, John F. Candy, Jitendra M. Malik, andDouglas D. Edwards. Artificial intelligence: a modern approach. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1996.

[31] Ubai SANDOUK. Speaker recognition, speaker diarization and identifica-tion. Master’s thesis, The University of Manchester, School of ComputerScience, 2012.

[32] F. Soong, A. Rosenberg, L. Rabiner, and B.-H. Juang. A vector quanti-zation approach to speaker recognition. In Acoustics, Speech, and SignalProcessing, IEEE International Conference on ICASSP ’85., volume 10,pages 387–390, Apr.

Page 60: Development of a Speaker Recognition Solution in Vidispine · Vidispine Karen Farnes May 23, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj

48 REFERENCES

[33] T. Stadelmann. Voice Modeling Methods: For Automatic Speaker Recog-nition. Sudwestdeutscher Verlag, 2010.

[34] Hao Tang, Zhixiong Chen, and Thomas S. Huang. Comparison of algo-rithms for speaker identification under adverse far-field recording condi-tions with extremely short utterances. In Proceedings of the IEEE Inter-national Conference on Networking, Sensing and Control, ICNSC 2008,Hainan, China, 6-8 April 2008, pages 796–801. IEEE, 2008.

[35] Y. Zhang, C. J S Desilva, R. Togneri, M. Alder, and Y. Attikiouzel.Speaker-independent isolated word recognition using multiple hiddenmarkov models. Vision, Image and Signal Processing, IEE Proceedings-, 141(3):197–202, June 1994.