Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and...

Post on 22-Jul-2020

0 views 0 download

Transcript of Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and...

1/29

Motivation Method Experiment

Black-box Generation of Adversarial TextSequences to Evade Deep Learning Classifiers

Ji Gao1, Jack Lanchantin1, Mary Lou Soffa1, Yanjun Qi1

1University of Virginiahttp://trustworthymachinelearning.org/

@ 1st Deep Learning and Security Workshop ; 2018

2/29

Motivation Method Experiment

Outline

1 Motivation

White box vs. black box

2 Method

Word scorerWord transformer

3 Experiment

4 Conclusions

3/29

Motivation Method Experiment

Example of black-box classification systems

Google Perspective API

3/29

Motivation Method Experiment

Example of black-box classification systems

Google Perspective API

4/29

Motivation Method Experiment

Target scenario

5/29

Motivation Method Experiment

An example of DeepWordBug

Goal: Flip the prediction of a sentiment analyzer

5/29

Motivation Method Experiment

An example of DeepWordBug

Goal: Flip the prediction of a sentiment analyzer

5/29

Motivation Method Experiment

An example of DeepWordBug

Goal: Flip the prediction of a sentiment analyzer

6/29

Motivation Method Experiment

AlgorithmOur Methods

7/29

Motivation Method Experiment

Challenges of language tasksOur Method

Adversarial examples

Suppose a deep learning classifier F (·) : X→ Y original sample isx , an adversarial example x ′ in Untargeted attack follows:

x′ = x + ∆x, ||∆x||p < ε, x′ ∈ XF (x) 6= F (x′)

When X is symbolic:

How to perturb x?

No metric for measuring ∆x

8/29

Motivation Method Experiment

Our settingOur Method

∆x = Edit distance(x, x′)

9/29

Motivation Method Experiment

DeepWordBugOur Methods

1. Scoring - Find important words to change

2. Transformation - Generate some modification on words oftop importance.

∆x = Edit distance(x, x′)

=∑

i∈Selected words

Edit distance(xi , x′i )

10/29

Motivation Method Experiment

Step 1: Scoring functionOur Methods

Goal: Select important words

The proposed scoring functions have the following properties:1 Correctly reflect the importance of words2 Black-box3 Efficient to calculate.

11/29

Motivation Method Experiment

Temporal Head Score

11/29

Motivation Method Experiment

Temporal Head Score

11/29

Motivation Method Experiment

Temporal Head Score

12/29

Motivation Method Experiment

Temporal Tail score

13/29

Motivation Method Experiment

Combined score

14/29

Motivation Method Experiment

Step 2: Ranking and transformation

Calculate the scoring function for all words in the input once.

Rank all the words according to the scores.

15/29

Motivation Method Experiment

Step 3: Word TransformerOur Methods

Original Substitution Swapping Deletion Insertion

Team → Texm Taem Tem Tezam

Artist → Arxist Artsit Artst Articst

Computer → Computnr Comptuer Compter Comnputer

Aim I: Machine-learning based classifier views generated wordsas “unknown”.

Aim II: Control the edit distance of the modification

16/29

Motivation Method Experiment

SummaryOur Methods

17/29

Motivation Method Experiment

Dataset

#Training #Testing #Classes Task

AG’s News 120,000 7,600 4NewsCategorization

Amazon ReviewFull

3,000,000 650,000 5SentimentAnalysis

Amazon ReviewPolarity

3,600,000 400,000 2SentimentAnalysis

DBPedia 560,000 70,000 14OntologyClassification

Yahoo! Answers 1,400,000 60,000 10TopicClassification

Yelp Review Full 650,000 50,000 5SentimentAnalysis

Yelp Review Polarity 560,000 38,000 2SentimentAnalysis

Enron Spam Email 26,972 6,744 2Spam E-mailDetection

18/29

Motivation Method Experiment

Methods in comparison

Random(Baseline): Random selection of words. Similar to(Papernot et al. 2013)

Gradient(Baseline): White-box method. Judging theimportance of the word using the magnitude of the gradient(Samanta, S., & Mehta, S. (2017).).

DeepWordBug(Our method): Use 3 Different scoringfunctions: Temporal Head, Temporal Tail and Combined.

19/29

Motivation Method Experiment

Main result: Effectiveness of adversarial samples (average)

6.82%

16.36%

63.02%

44.40%

68.05%64.38%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

Random Gradient Replace-1 TemporalHead

TemporalTail

Combined

Re

lati

ve

Pe

rfo

rma

nce

De

cre

ase

(%)

DeepWordBug

20/29

Motivation Method Experiment

Question: Are the generated adversarial samplestransferable to other models?

Adversarial samples generated on one model can besuccessfully transferred between models, reducing the modelaccuracy from around 90% to 20-50%

21/29

Motivation Method Experiment

Question: How does different transformer functions work?

Varying transformation function have small effect on theattack performance.

22/29

Motivation Method Experiment

Question: How strong are the adversarial samplesgenerated?

The adversarial samples generated successfully make themachine learning model to believe a wrong answer with 0.9probability

23/29

Motivation Method Experiment

Defense: by Adversarial training

88.5%85.9%87.3%87.6%87.5%87.4%87.4%87.6%87.5%86.8%87.0%

11.9%

30.2%

45.0%52.4%

57.1%58.8%58.8%59.9%60.5%61.6%62.7%

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

0 1 2 3 4 5 6 7 8 9 10

Accuracy Adversarial accuracy

ReTrain the model with adversarial samples.

Accuracy on raw inputs slightly decreases;

Accuracy on the adversarial samples rapidly increases fromaround 12% (before the training) to 62% (after the training)

24/29

Motivation Method Experiment

Defense: by an autocorrector?

Original Attack Defend with Autocorrector

Swap 88.45% 14.77% 77.34%

Substitute 88.45% 12.28% 74.85%

Remove 88.45% 14.06% 62.43%

Insert 88.45% 12.28% 82.07%

Substitute-2 88.45% 11.90% 54.54%

Remove-2 88.45% 14.25% 33.67%

While spellchecker reduces the effectiveness of the adversarialsamples, stronger attacks such as removing 2 characters inevery selected word still can successfully reduce the modelaccuracy to 34%

25/29

Motivation Method Experiment

Related Works

Related works:

Papernot et. al 2016Iteratively:

Pick words randomlyApply gradient based algorithm directly on the word embeddingProject to the nearest word

Samanta & Sameep 2017Iteratively:

Pick important words using gradientGenerate linguistic based modification on the words

Summary: White-box and costly

26/29

Motivation Method Experiment

Conclusion

Black-box: DeepWordBug generates adversarial samples in apure black-box manner.

Performance: Reduce the performance of state-of-the-art deeplearning models by up to 80%

Transferability: The adversarial samples generated on onemodel can be successfully transferred to other models,reducing the target model accuracy from around 90% to20-50%.

27/29

Motivation Method Experiment

Reference

Goodfellow, Ian, J., Jonathon Shlens, and Christian Szegedy. ”Explaining andharnessing adversarial examples.” arXiv preprint arXiv:1412.6572 (2014).

Papernot, Nicolas, et al. ”Crafting adversarial input sequences for recurrentneural networks.” Military Communications Conference, MILCOM 2016-2016IEEE. IEEE, 2016.

Samanta, Suranjana, and Sameep Mehta. ”Towards Crafting Text AdversarialSamples.” arXiv preprint arXiv:1707.02812 (2017).

Zhang, Xiang, Junbo Zhao, and Yann LeCun. ”Character-level convolutionalnetworks for text classification.” Advances in neural information processingsystems. 2015.

Rayner, Keith, Sarah J. White, and S. P. Liversedge. ”Raeding wrods withjubmled lettres: There is a cost.” (2006).

28/29

Motivation Method Experiment

Why Word Transformer is Effective?

Do not guarantee the original word will be changed to“unknown”, but failure chance is very slight

Suppose the longest word in the dictionary is length l , thereare 27l possible letter sequences ≤ l

Let l = 8, and |D| = 20000. The chance that changed word is

not “unknown” is roughly 278

20000 ≈ 0.00000007

29/29

Motivation Method Experiment

Why current scoring functions?

For a single step, Replace-1 score gives the bestapproximation.

However, globally it’s not optimal.

Example:

Here, Temporal tail gives better result than Replace-1.