All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay,...

Post on 18-Jan-2016

220 views 0 download

Tags:

Transcript of All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines Dibya Mukhopadhyay,...

All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines

Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh SaxenaUniversity of Alabama at Birmingham, USA

Premise

• We leave voice traces behind• How difficult is it to make a machine talk like you?• What are the consequences?• Voice is used as a biometrics -> attacking voice-based user

authentication system• Voice makes us known to people -> attacking arbitrary

speech contexts

smita
fix ordering

Voice Morphing

• TTS Voice Synthesis (e.g., [AT&T voice synthesizer])

• Voice Conversion (e.g., Festvox)

Trained Voice Conversion

System

Source (Attacker) Speaker samples

Target (Victim) Speaker samples

map the source voice to target voice

Training

TestingInput: Samples in Attacker Voice

Output: Samples Spoken in Victim’s Voice

Voila!

smita
add citations to both approaches

Speaker Verification

• Machine-based Speaker Verification (e.g., [Douglas et al., DSP, 2006])

• A 2-class problem to identify claimant • System creates a model of a speaker in the training phase

to be verified in testing phase

• Human-based Speaker Verification• A human user serves as the verifier • Implicit in arbitrary communication

smita
add some citations
smita
add a paper representative

Our Contributions

• We study voice impersonation attacks • We evaluate attack feasibility against state-of-the-art

automated speaker verification algorithms as well as manual verification

• Our attacks represent realistic settings and are practical• We use an off- the-shelf voice morphing engine

• We use very less amount of training samples for voice conversions : approx. 6-8 minutes of training speech

• Most of the training samples are recorded using low-end devices such as smartphones / laptops

smita
cite the papers that break autoamted speaker verification

System and Threat Model

Phase II: Building Voice Morphing Model

Training Conversion

Attacker’s (Source S) Voice

A =

(a1…

a m)

Any

utte

ranc

eOS = (s

1 …s

n )

Same utterance

as OT

M = µ(OS, OT)

Audio Recording

Target’s (T) Audio Samples

WiretappingSocial Media

OT = (t1…tn)

Phase I: Collecting Audio Samples

Bob

fT = M(A) = (f1…fm)

Human-based Speaker Verification

Machine-based Speaker Verification

Phase III: Attacking Applications with Morphed Voices

?Access Granted

I am Bob

Fake Utterance A in Bob’ voice

smita
turn into animation

Experiments and Measures

• Benign Setting: Test samples spoken by original speaker

• Attack Setting• Different Speaker Attack • Conversion Attack

• Metrics Used:• False Rejection Rate (FRR): fraction of genuine samples

rejected in benign setting• False Acceptance Rate (FAR): fraction of attack samples

accepted in attack setting

smita
make the notions FRR and FAR consistnet

Attacking Machine-based Speaker Verification

Tools and Algorithms

• Festvox Voice Conversion System• Bob Spear Speaker Verification System [E. Khoury; ICASSP, 2014]

• UBM-GMM: A modeling technique that uses the spectral features; computes a log-likelihood of the Gaussian Mixture Models for background modeling and speaker verification

• ISV: An improvement to UBM-GMM, where a speaker’s variability due to age, surroundings, etc., are compensated for, and it gives better performance for the same user in different scenarios

Datasets

• Voxforge• Recorded using standard recording devices, length: 5 secs• 28 (all male) speakers (chosen)

• MOBIO• Recorded using laptop microphones, length: 7-30 secs• 152 (99 male, 53 female) speakers

smita
move to the attack dataset part

Conversion Attack Setup

• Voxforge: • Attacker: 1 male speaker (CMU Arctic)• Victims: 8 speakers• Training: 100 samples of 5 secs each (i.e.,≈ 8 mins speech)

• MOBIO: • Attackers: 6 male and 3 female speakers• Victims: 32 male and 17 female speakers• Training: 12 samples of 30 secs each (i.e.,≈ 6 mins speech)

CMU Arctic Databases: http://festvox.org/cmu_arctic/index.html

smita
make it high level

Different Speaker Attack Setup

• Testing Voxforge: Original samples were swapped by samples spoken by each of the chosen CMU Arctic speakers

• Testing MOBIO: Original samples were swapped with other speakers’ samples

Results

YesNoYesNo Yes

Yes

smita
use same labeling as in the other attack table

Attacking Human-based Speaker Verification

User Studies

• Famous Speaker Study: Attackers mimic celebrities, users have to identify celebrities’ samples

• Briefly Familiar Speaker Study: Attackers mimic speakers, users have to identify speakers’ samples

• Study Platform: Amazon Mechanical Turk (M-Turk)• # of Participants: 65 and 32 (for the two studies) M-Turk

online users• Related work: Prior work [Shirvanian-Saxena; CCS’14] studied

“Short Authenticated Strings”; we look at arbitrary speech

smita
remove incentieves

Famous Speaker Study Setup

• Samples collected using an application published on M-Turk • 5 Female speakers mimicked Oprah Winfrey (100 samples)• 5 Male speakers mimicked Morgan Freeman (100 samples)

• Users listen to a 2-min speech of Oprah and Morgan followed by several benign and attacked challenges

• Speaker Verification: identify the original speaker• Voice Similarity Test: rank the similarity of voice to

the original speaker

Attack Setup

• Different Speaker Attack • Female M-Turk Speakers for Oprah• Male M-Turk Speakers for Morgan

• Conversion Attack:• # of Training samples: 100 sentences of 4 secs each• Source: Male/Female M-Turk Speakers• Target: Oprah/Morgan

smita
make it high lvevel...rm number of speakers

Tests

• Speaker Verification Test: • Question: Is the speaker Oprah/Morgan?• Answer options: Yes, No, Not Sure

• Voice Similarity Test• Question: How similar is each sample to Oprah/Morgan?• Answer options: exactly similar, very similar, somehow

similar, not very similar, different

smita
combne this with next and make high level

Briefly Familiar Study Setup

• Male and female M-Turk speakers as victims • from the previous dataset

• 90 secs long victim’s voices played for familiarization• Speaker Verification Test (as before)• Voice Similarity Test (as before)

smita
make high level

Attack Setup

• Different Speaker Attack • Female M-Turk Speakers for Female Speaker• Male M-Turk Speakers for Male Speaker

• Conversion Attack:• Source: Female/male M-Turk Speakers• Target: Female/male M-Turk Speakers

smita
either remove or make high level

Results: Speaker Verification Test

Results: Voice Similarity TestOprah

Morgan

• Original Speaker: 88.08% found “exactly similar” or “very similar”

• Different Speaker: 86.81% found “different” or “not very similar”

• Conversion Attack: 74.10% rated “somehow similar” or “very similar”

• Original Speaker: 95.77% found “exactly similar” or “very similar”

• Different Speaker: 94.36% found “different” or “not very similar”

• Conversion Attack: 59.74% rated “somehow similar” or “very similar”

Results: Voice Similarity Test

Briefly Familiar Speaker Study

• Original Speaker: 88.08% found “exactly similar” or “very similar”

• Different Speaker: 86.81% found “different” or “not very similar”

• Conversion Attack: 74.10% rated “somehow similar” or “very similar”

Conclusions

• Conversion attack is successful about 80-90% against state-of-the-art speaker verification algorithms

• About 50% of the cases, human verifiers were fooled by morphed samples

• Attacks against human verifiers will improve as voice conversion/synthesis techniques will continue to improve

Limitations and Future Work

• We only used the known state-of-the-art biometric speaker verification system and an off-the-shelf voice conversion tool.

• The possibility of accepting an attacked sample may increase in real-life as people may not pay due attention.

• Attacks might improve when the human subjects have any hearing disability

• The current study does not tell us how the attacks might work in other scenarios such as faking real-time communication, or faking court evidences.

Thank You!