Download - VoiceGrid™ VOICE RECOGNITION TECHNOLOGY WHITE PAPER

VoiceGrid™

VOICE RECOGNITION TECHNOLOGY WHITE PAPER

October 2011

Voice recognion technology can be ulized to provide unique and valuable

enhancements to law enforcement invesgaons and idenficaon opera-

ons. When discussing voice recognion as it is applicable to law enforcement,

it is important to define the modality to avoid any confusion with its intent or

purpose. Unlike speech recognion which is intended to understand the spo-

ken word or voice verificaon which is ulized primarily for physical or logical

access in a 1:1 modality, voice recognion is the ability to idenfy who is

speaking, independent of text or language as well as other variables which will

be discussed later in this paper. Addionally the idenficaon of the voice

should be able to be performed in a 1:N method against a few, hundreds,

thousands and even millions of other voice samples. This whitepaper will dis-

cuss how voice recognion can be used in the law enforcement environment,

the challenges that the technology must address to be effecve and the possi-

ble use cases.

In order to understand where we are now and the direcon we are moving

towards with voice recognion in the law enforcement environment it is im-

portant to understand a li*le history of voice and audio technology in general.

Audio evidence has been accepted in US courts since 1958 as determined by

the decision of United States v. McKeever, 169 F. Supp. 426 (SDNY 1958)1. Ad-

dionally audio evidence has passed Frye, Daubert and Federal Rules of Evi-

dence challenges. Historically however and as is sll the case today, one would

need to enlist the experse of a Forensic Audiologist or an “Audio Expert” in

order to produce the data, reports, charts etc. required to tesfy as to the va-

lidity and authencity of audio evidence. The process of extracng and pro-

ducing this informaon in a format that could be comprehended by a layper-

son (i.e. jury) was a manual and somemes tedious pracce. SpeechPro was

and connues to be a world leader in developing technology for the manual

processing and analysis of audio data and it was largely upon this technology

that the company was founded in 1990. However during the evoluon of

SpeechPro in the last 21 years, it was recognized that if some of these manual

processes could be automated and performed by a non-expert or praconer,

and further if said processes could be delivered to perform in real-me or near

INTRODUCTION SCOPE

SpeechPro offers

leading edge voice

recognion tech-

nologies under the

VoiceGrid™ brand-

ing. Automated

voice recognion is

now being ulized

by law enforcement

agencies and can

leverage both ex-

isng and new au-

dio data to provide

value in invesga-

ve and operaon-

al efforts. This ar-

cle describes the

principles and ap-

plicaons of the

VoiceGrid™ tech-

nology.

© 2011 SpeechPro, Inc.—All rights reserved 1

real-me, the resulng technology could have dramac posive ramificaons

to law enforcement endeavors. To that end SpeechPro assembled a team of

over 100 developers, technical specialists and research sciensts whose work

has led to the current set of VoiceGrid™ voice recognion deliverables availa-

ble today.

Voice recognion draws similar parallels to fingerprint and face recognion

biometrics in that a voiceprint also has available parameters that provide

unique informaon about a person. These parameters are individual to each

person, are created through the specific anatomical makeup of a person and

therefore can provide disncve informaon about the person’s identy.

However while there are similaries to other biometrics, voice recognion has

some unique challenges which must be overcome in order to employ it in a

praccal and effecve manner. Voice recognion requires not only the ability

to do the core voice matching but also a significant amount of pre-processing

in order to normalize the samples and account for variables which include,

background noise, differences in the microphones used to obtain the record-

ings and differences in the devices from which the recording was obtained.

Collecvely these are known as Channel Effects. Addionally, once the voice

sample has been collected, the technology must compensate for voice variabil-

ity such as emoonal state, illness and physical condion such as slurring of

THE VoiceGrid™ PROCESS

FINGERPRINT FACEPRINT

VOICEPRINT

SpeechPro VoiceGrid™ Voice Recognion White Paper |


Voice recognion

shares similaries

with other biomet-

rics for unique hu-

man idenfiers but

is subject to other

challenges beyond

matching capabili-

es which must be

addressed in order

for praccal imple-

mentaon. Voice-

Grid™ technology

accounts for these

obstacles.

words if the subject is under the influence of drugs or alcohol. SpeechPro

VoiceGrid™ technology uses an automac mul-layer pre-process which in-

cludes the following steps:

• Voice data is separated from other sounds – music, clicks, buzzing, horns,

crackling, etc.

• Sound to noise rao and reverb levels are calculated

• Type of channel is detected

• Data is re-sampled to 11.025 kHz

• Noise suppression, equalizaon & distoron compensaon algorithms are

applied

Once the audio sample is pre-processed the system then performs the Feature

Extracon, Voice Model Building & Idenficaon steps which include:

• Automac extracon of voice data

• The extracted biometric traits are saved as individual voice models

The voice models are then compared to voice models in the database. The sys-

tem uses three independent methods to achieve this which are:

⇒ Pitch Sta�s�cs

⇒ Spectral-Formant

⇒ Gaussian Mixture Models & Support Vector Machine Classifica�ons

(GMM & SVM)

Pitch Sta�s�cs

The Pitch Stascs Method (PSM) contains 16 different pitch parameters, in-

cluding average pitch value, maximum, minimum, median, percent of areas

with rising pitch, pitch logarithm variaon, pitch logarithm asymmetry, pitch

logarithm excess and 8 addional parameters.

The image above is an example of automated pitch extracon in the phrase

“forensic audio” pronounced by two different persons.



The VoiceGrid™

technology uses

three unique and

disnct methods to

achieve a high de-

gree of matching

accuracy.

Spectral-Formant

An example of formant representaon of the phrase “Forensic audio” pro-

nounced by two different persons is shown in the picture (The horizontal axis

is me in seconds. The vercal axis is frequency in Hertz. Energy level is depict-

ed by the darkness of the trace).

GVM & SVM

Voice data is converted into stascal data and compared in a proprietary

method.

Decision Making

In the final step of voice recognion, decision making is performed and results

are returned not only for each matching method but also in a fusion decision

where all results are combined and averaged. The fusion decision results are

returned in a ranked order with a % score of 1-100. The reason for the fusion

decision is not only to provide a score that is easy to assess but also because

the various algorithms used have different abilies to recognize voices with

different variability such are shown in the table on the following page:



The Fusion Decision Example:

It should be noted here that the results returned by the technology should be

interpreted for the purpose of inclusion or exclusion of subjects and that if

posive idenficaon is required for court evidenary purposes, the services

of a forensic audio expert should be enlisted.

Accuracy

As with all 1:N biometrics, overall accuracy will be a product of the quality of

the data but with voice recognion it is also dependent on the quanty or

length of the voice data collected. In the following graph a sharp increase in

accuracy is noted within the first 15 seconds of voice data aOer which the ac-

curacy levels off .

VoiceGrid™ PERFORMANCE



The fusion decision

combines the re-

sults from each al-

gorithmic match

into a single score

which provides for

greater overall

compensaon of

speech signal char-

acterisc strengths.

Matching accuracy

is a product not just

of quality but also

of quanty.

System Scaling

The VoiceGrid™ technology can be scaled to address most any performance

needs. System performance will be a product of three factors which are data-

base size, enrollment quanty and response me. The average performance

based on a single core 2 Duo CPU @ 2GHz with 2GB RAM would be enrollment

of 6-12 records per minute and 3,000 to 40,000 comparisons per minute how-

ever the system can be opmized to perform over 1 Million matches per mi-

nute.

A key element in applying voice recognion technology is the creaon of the

voice databases. One reason that voice recognion is ideal for law enforce-

ment is the availability of data already being collected. 911 calls, suspect inter-

views, court recordings, voice mail, recordings made while speaking to com-

mercial service providers such as banks, cell phone companies, cable TV com-

panies can all provide a passive source for collecng data. Acve recording

such as during the booking process can provide an ideal known voice sample.

SpeechPro has developed a “best pracces” document for consideraon when

collecng voice data during the booking process. This includes such elements

as prompng the subject to say the alphabet, counng to 30 and stang their

name and address. However because of the text independent nature of the

technology it is not so important what the subject says as how much they say

or in other words as long as they are talking, the necessary data can be collect-

ed. In all cases the SpeechPro VoiceGrid™ technology has the ability to analyze

each recording, collect the necessary amount of audio informaon and pro-

vide a prompt back to the operator as to whether the sample is adequate for

matching. In the most recent VoiceGrid™ update the ability to automacally

separate the voices within a 2-person dialog and send each voice individually

for matching has been added.

Use cases for voice recognion are many and varied but overall voice recogni-

on is applicable in any situaon where audio may be the only lead or evi-

dence. Such cases include domesc abuse or calls in violaon of protecon

orders, prank or false emergency calls to 911, terrorisc threat calls, agency

radio communicaon abuse, inmate call monitoring, kidnapping, extoron,

corrupon, gang & organized crime communicaons and audio from video

VOICE DATA COLLECTION

USE CASES



recordings where the speakers are not visually idenfiable. In the following

scenarios the voice recordings of interest (VROI) could be obtained from any

number of different sources from both crimes and invesgaons including ter-

rorisc threats, kidnapping ransom calls, messages leO in violaon of protec-

on orders, gang or organized crime acvity etc.

Voice Recogni�on without a known database – 1). A VROI is obtained . 2.)

Invesgators interview a suspect pool and record the sessions. 3.) VoiceGrid™

Voice recognion is employed to match the original (latent) recording against

the suspect pool. 4.) Results are returned with confidence ranking and inves-

gators can include or exclude suspects.

Voice Recogni�on with a known database – 1.) A VROI is obtained . 2.) Inves-

gators employ VoiceGrid™ to send the unknown recording for possible match

against the previously obtained known database. 3.) A possible candidate list is

returned with confidence ranking. 4.) Invesgators can include or exclude sus-

pects.

Real Time Voice Recogni�on with a known watch list database – 1.) Invesga-

tors set up the lawful intercepon of mulple phone lines looking for a specific

individual whose voice is in a known VoiceGrid™ database. 2.) VoiceGrid™

technology monitors each phone line connually sampling the voices. 3.)

VoiceGrid™ technology idenfies a match and alerts the invesgator to which

phone line contains the voice of the suspect.

Audio Recording Indexing with a known watch list database – 1.) Invesga-

tors collect a large amount of recorded dialog from a lawful phone or eaves-

dropping intercepon. 2.) Invesgators are only interested in specific suspect

voices. Recordings are played into the VoiceGrid™ technology. 3.) VoiceGrid™

technology returns a report on where the specific voices have been idenfied

within the recordings. 4.) Invesgators can review the recordings which con-

tain voices only germane to their invesgaon.

In the preceding examples, voice recognion is employed by non-expert oper-

ators for inclusionary or exclusionary purposes. In all cases a decision could be

made to send the collected voice data on to a forensic audio expert for further

evidenary development but the inial assessment has now potenally pro-

vided value that would not otherwise be appreciated in the absence of the

automated capabilies.



The VoiceGrid™ technology has been developed to opmize voice recognion

matching performance and to compensate for typical variability of audio sam-

ples. It however is not fool-proof nor can it correct all issues with poor sam-

ples. As with any biometric technology the results will only be as good as the

quality of the data a*empng to be matched. In the early days of facial recog-

nion, inial a*empts to use exisng mug shot photos would in many cases

produce disappoinng results as the photographs had not been inially ac-

quired with facial recognion in mind and therefore lacked the quality re-

quired due to challenges with resoluon, pose or lighng. In much the same

way, some audio recordings have not been obtained and/or saved in the best

condions or formats to provide opmal voice recognion performance. While

one of the major benefits of voice recognion is that no “special” equipment

needs to be employed other than a good quality microphone, one should be

cognizant of the recording environment and also understand the difference

between high quality recording devices and their inferior low quality counter-

parts. As an example, a high quality microphone poorly placed within a room

where it is suscepble to echoes, ambient noise from fluorescent lighng and

too far away from the subject will most likely not produce a quality recording.

Conversely, a low quality recorder with a low sampling rate and no built-in

noise suppression will also most likely provide poor recording quality. This

does not mean that recordings for voice recognion need to take place within

a “sound studio” environment but rather it is an example to illustrate that re-

cordings of acceptable quality can be obtained at a negligible cost and equip-

ment difference if forethought is applied to the fact that they will be used at

some point for voice recognion and consideraons to that end are made in

advance.

Implementaon of the VoiceGrid™ technology can be achieved in several

ways. SpeechPro offers turn-key applicaons in various configuraons to ac-

commodate different size databases and system architectures from single

stand-alone applicaons which can run on a laptop or PC to LAN or WAN sys-

tems which support mulple workstaons, enrollment staons and databases.

Addionally SpeechPro can offer development support to interface Voice-

Grid™ technology to an exisng system. SpeechPro also offers the VoiceGrid™

SDK which allows an exisng soluons provider who has development re-

sources to integrate the VoiceGrid™ technology as an addional or new mo-

dality to an exisng idenficaon plaTorm.

ADDITIONAL CONSIDERATIONS

VoiceGrid™ IMPLEMENTATION



Consideraon

should now be giv-

en to the fact that

voices are now be-

ing recorded for

eventual voice

matching and not

just for playback

and review at a lat-

er date.

The VoiceGrid™

product line in-

cludes several ap-

plicaons and tools

to accommodate a

wide-range of cus-

tomer needs.

Voice recognion for law enforcement has been implemented primarily out-

side of the United States at this me. The best example is the naonwide sys-

tem put into place by the Mexican Federal Police. First implemented in 2010,

this system ulizes VoiceGrid™ technology to provide voice recognion against

a database of over 600,000 records accessed by over 250 jurisdicons across

the country. The system is projected to grow to over 1 million records during

2012 and has been instrumental in assisng to solve both kidnapping and cor-

rupon cases. The database is comprised primarily of voice samples obtained

during the criminal booking process as well as voice records from public work-

ers.

In the United States, SpeechPro has obtained commitment from a State Jusce

Agency to deploy a pilot system for the purposes of performance studies and

further development of best pracces for obtaining voice samples in a booking

environment and is currently working with a systems integrator to deliver the

beta version of the system by the end of 2011.

Addionally, voice recognion is a recognized emerging technology at the fed-

eral level in the United States as demonstrated by the endeavors and com-

ments of the following agencies:

NIST

• Developing Standards for Type 11 Record for Voice Biometrics

• Yearly Evaluaons Since 1996

FBI NGI

“The future of iden�fica�on systems is currently progressing beyond the de-

pendency of a unimodal (e.g., fingerprint) biometric iden�fier towards mul�-

modal biometrics (i.e., voice, iris, facial, etc.).”

FBI Biometric Center of Excellence

“A popular choice for remote authen�ca�on due to the availability of devices

for collec�ng speech samples (e.g., telephone network and computer micro-

phones) and its ease of integra�on.”

CURRENT USE



Voice recognion technology can add new and useful funconality to law en-

forcement invesgave and idenficaon operaons. The technology is cur-

rently viable for implementaon and has already been proven successful in

large naonal installaons.

Voice recognion certainly can be used in an individual modality but should

also be strongly considered as part of a mul-modal biometric program and as

an addional crime solving tool for law enforcement agencies.

In addion to voice matching, other voice and audio soluons & applicaons

should be considered as part of a comprehensive plan to best leverage exisng

and new audio data. Audio acquision, recording, noise filtering, audio au-

thencity and forensic audio processing are all integral elements to an effi-

cient, praccal and producve law enforcement audio program.

SpeechPro offers soluons to address all these aspects and is proud to be a

world leader in audio and voice technology. Our VoiceGrid™ technology repre-

sents the “best of breed” in commercially deployable voice recognion solu-

ons, but as with any biometric it is the sum of the performance capabilies,

operaonal procedures, quality of data and management of expectaons that

will define its effecveness in the long term.

For more informaon please contact:

SpeechPro

369 Lexington Avenue, Suite 316

New York, NY 10017

(646) 237-7895

www.speechpro.com

[email protected]

For immediate inquiries please contact:

Bill Michael

Na�onal Sales Director

(610) 867-8165

[email protected]

References—1 LAW AND THE EXPERT WITNESS- THE ADMISSIBILITY OF RECORDED EVI-

DENCE - Tom Owen, Jennifer Owen, Jill Lindsay, Michael McDermo*, Owl Invesga-

ons, Inc., Frank M. McDermo* Ltd. 2005



CONCLUSION The VoiceGrid™

Product Line

VoiceGrid™ Lab – stand-

alone applica�on for 1:N

searches of up to 100

records per individual

inves�ga�on.

VoiceGrid™ LN – single

server network solu�on

for 1:N searches, supports

a database of up to

10,000 records and up to

10 worksta�ons.

VoiceGrid™ ID – net-

worked 1:N search and

voice data management

solu�on with unlimited

database size and web

client worksta�on con-

nec�vity.

VoiceGrid™ X – a stand-

alone applica�on for

speaker iden�fica�on in

mul�ple files. Performs a

N:N search of up to 100

target speakers in up to

10,000 records per day.

VoiceGrid™ RT – distribut-

ed solu�on for real-�me

speaker iden�fica�on in

communica�on channels.

Integrates with voice

database up to 10,000

target speakers.

VoiceGrid™ SDK provides

the same tools and algo-

rithms available within

the VoiceGrid™ line but

with the ability for busi-

ness partners with devel-

opment resources to inte-

grate the technology into

their own plaForms.