BAE: BERT-based Adversarial Examples for Text Classificationexamples. These adversarial examples do...

8
BAE: BERT-based Adversarial Examples for Text Classification Siddhant Garg * Goutham Ramakrishnan * University of Wisconsin Madison {sidgarg,gouthamr}@cs.wisc.edu Abstract Modern text classification models are suscep- tible to adversarial examples, perturbed ver- sions of the original text indiscernible by hu- mans but which get misclassified by the model. We present BAE, a powerful black box at- tack for generating grammatically correct and semantically coherent adversarial examples. BAE replaces and inserts tokens in the origi- nal text by masking a portion of the text and leveraging a language model to generate al- ternatives for the masked tokens. Compared to prior work, we show that BAE performs a stronger attack on three widely used models for seven text classification datasets. 1 Introduction Recent studies have shown the vulnerability of ML models to adversarial attacks, small pertur- bations which lead to misclassification of inputs. Adversarial example generation in NLP (Zhang et al., 2019) is more challenging than in common computer vision tasks (Szegedy et al., 2014; Ku- rakin et al., 2017; Papernot et al., 2017) due to two main reasons: the discrete nature of input space and ensuring semantic coherence with the original sentence. A major bottleneck in applying gradi- ent based (Goodfellow et al., 2015) or generator model (Zhao et al., 2018) based approaches to gen- erate adversarial examples in NLP is the backward propagation of the perturbations from the continu- ous embedding space to the discrete token space. Recent works for attacking text models rely on introducing errors at the character level in words (Ebrahimi et al., 2018; Gao et al., 2018) or adding and deleting words (Li et al., 2016; Liang et al., 2017; Feng et al., 2018), etc. for creating adversarial examples. These techniques often re- sult in adversarial examples which are unnatural * Equal contributions by both authors Figure 1: We use BERT-MLM to predict masked to- kens in a given text for generating adversarial examples. The MASK token replaces a given word (BAE-R at- tack) or is inserted to the left/right of the word (BAE-I attack). Actual predictions made by the BERT-MLM are shown (they fit well into the sentence context) looking and lack grammatical correctness, and thus can be easily identified by humans. TextFooler (Jin et al., 2019) is a black-box attack, that uses rule based synonym replacement from a fixed word embedding space to generate adversarial examples. These adversarial examples do not ac- count for the overall semantics of the sentence, and consider only the token level similarity using word embeddings. This can lead to out-of-context and unnaturally complex replacements (see Table 3), which can be easily identifiable by humans. The recent advent of powerful language models (Devlin et al., 2018; Radford et al., 2019) in NLP has paved the way for using them in various down- stream applications. In this paper, we present a simple yet novel technique: BAE (BERT-based Ad- versarial Examples), which uses a language model (LM) for token replacement to best fit the overall context. We perturb an input sentence by either replacing a token or inserting a new token in the sentence, by means of masking a part of the input and using a LM to fill in the mask (See Figure 1). BAE relies on the powerful BERT masked LM for ensuring grammatical correctness of the adversarial examples. Our attack beats the previous baselines arXiv:2004.01970v1 [cs.CL] 4 Apr 2020

Transcript of BAE: BERT-based Adversarial Examples for Text Classificationexamples. These adversarial examples do...

Page 1: BAE: BERT-based Adversarial Examples for Text Classificationexamples. These adversarial examples do not ac-count for the overall semantics of the sentence, and consider only the token

BAE: BERT-based Adversarial Examples for Text Classification

Siddhant Garg∗ Goutham Ramakrishnan∗University of Wisconsin Madison

{sidgarg,gouthamr}@cs.wisc.edu

Abstract

Modern text classification models are suscep-tible to adversarial examples, perturbed ver-sions of the original text indiscernible by hu-mans but which get misclassified by the model.We present BAE, a powerful black box at-tack for generating grammatically correct andsemantically coherent adversarial examples.BAE replaces and inserts tokens in the origi-nal text by masking a portion of the text andleveraging a language model to generate al-ternatives for the masked tokens. Comparedto prior work, we show that BAE performs astronger attack on three widely used modelsfor seven text classification datasets.

1 Introduction

Recent studies have shown the vulnerability ofML models to adversarial attacks, small pertur-bations which lead to misclassification of inputs.Adversarial example generation in NLP (Zhanget al., 2019) is more challenging than in commoncomputer vision tasks (Szegedy et al., 2014; Ku-rakin et al., 2017; Papernot et al., 2017) due to twomain reasons: the discrete nature of input spaceand ensuring semantic coherence with the originalsentence. A major bottleneck in applying gradi-ent based (Goodfellow et al., 2015) or generatormodel (Zhao et al., 2018) based approaches to gen-erate adversarial examples in NLP is the backwardpropagation of the perturbations from the continu-ous embedding space to the discrete token space.

Recent works for attacking text models relyon introducing errors at the character level inwords (Ebrahimi et al., 2018; Gao et al., 2018) oradding and deleting words (Li et al., 2016; Lianget al., 2017; Feng et al., 2018), etc. for creatingadversarial examples. These techniques often re-sult in adversarial examples which are unnatural

∗ Equal contributions by both authors

Figure 1: We use BERT-MLM to predict masked to-kens in a given text for generating adversarial examples.The MASK token replaces a given word (BAE-R at-tack) or is inserted to the left/right of the word (BAE-Iattack). Actual predictions made by the BERT-MLMare shown (they fit well into the sentence context)

looking and lack grammatical correctness, and thuscan be easily identified by humans.

TextFooler (Jin et al., 2019) is a black-box attack,that uses rule based synonym replacement from afixed word embedding space to generate adversarialexamples. These adversarial examples do not ac-count for the overall semantics of the sentence, andconsider only the token level similarity using wordembeddings. This can lead to out-of-context andunnaturally complex replacements (see Table 3),which can be easily identifiable by humans.

The recent advent of powerful language models(Devlin et al., 2018; Radford et al., 2019) in NLPhas paved the way for using them in various down-stream applications. In this paper, we present asimple yet novel technique: BAE (BERT-based Ad-versarial Examples), which uses a language model(LM) for token replacement to best fit the overallcontext. We perturb an input sentence by eitherreplacing a token or inserting a new token in thesentence, by means of masking a part of the inputand using a LM to fill in the mask (See Figure 1).BAE relies on the powerful BERT masked LM forensuring grammatical correctness of the adversarialexamples. Our attack beats the previous baselines

arX

iv:2

004.

0197

0v1

[cs

.CL

] 4

Apr

202

0

Page 2: BAE: BERT-based Adversarial Examples for Text Classificationexamples. These adversarial examples do not ac-count for the overall semantics of the sentence, and consider only the token

by a large margin and confirms the inherent vul-nerabilities of modern text classification models toadversarial attacks. Moreover, BAE produces morericher and natural looking adversarial examples asit uses the semantics learned by a LM.

To the best of our knowledge, we are the first touse a LM for adversarial example generation. Wesummarize our major contributions as follows:• We propose BAE, a novel strategy for gen-

erating natural looking adversarial examplesusing a masked language model.• We introduce 4 BAE attack modes, all of

which are almost always stronger than previ-ous baselines on 7 text classification datasets.• We show that, surprisingly, just a few re-

place/insert operations can reduce the accu-racy of even a powerful BERT-based classifierby over 80% on some datasets.

2 Methodology

Problem Definition We are given a dataset(S, Y ) = {(S1, y1), (S2, y2) . . . (Sm, ym)} and atrained classification model C : S → Y . We as-sume the soft-label black-box setting where the at-tacker can only query the classifier for output prob-abilities on a given input, and does not have accessto the model parameters, gradients or training data.For an input pair (S, y), we want to generate an ad-versarial example Sadv such that C(Sadv) 6=y whereSadv is natural looking, grammatically correct andsemantically similar to S (by some pre-defined def-inition of similarity).

BAE For generating adversarial example Sadv, wedefine two perturbations on the input S:

• Replace a token t ∈ S with another• Insert a new token t′ in S

Some tokens in the input are more attended to by Cthan others, and therefore contribute more towardsthe final prediction. Replacing these tokens or in-serting a new token adjacent to them can thus havea stronger effect on altering the classifier prediction.We estimate the token importance Ii of each tokenti ∈ S = [t1, . . . , tn], by deleting ti from S andcomputing the decrease in probability of predictingthe correct label y, similar to (Jin et al., 2019).

While the motivation for replacing tokens in de-creasing order of importance is clear, we conjecturethat adjacent insertions in this same order can leadto a powerful attack. This intuition stems from

the fact that the inserted token changes the localcontext around the original token.

The Replace (R) and Insert (I) operations areperformed on a token t by masking it and insertinga mask token adjacent to it in S respectively. Thepre-trained BERT masked language model (MLM)is used to predict the mask tokens (See Figure 1).

BERT is a powerful LM trained on a large train-ing corpus (∼ 2 billion words), and hence the pre-dicted mask tokens fit well grammatically in S. TheBERT-MLM does not however guarantee semanticcoherence to the original text S as demonstrated bythe following simple example. Consider the sen-tence: ‘the food was good’. For replacing the token‘good’, BERT-MLM may predict the tokens ‘nice’and ‘bad’, both of which fit well into the context ofthe sentence. However, replacing ‘good’ with ‘bad’changes the original sentiment of the sentence.

To ensure semantic similarity on introducing per-turbations in the input text, we filter the set of topK masked tokens (K is a pre-defined constant) pre-dicted by BERT-MLM using a Universal SentenceEncoder (USE) (Cer et al., 2018) based sentencesimilarity scorer. For the R operations we add anadditional check for grammatical correctness ofthe generated adversarial example by filtering outpredicted tokens that do not form the same part ofspeech (POS) as the original token ti in the sen-tence.

To choose the token for a perturbation (R/I) thatbest attacks the model from the filtered set of pre-dicted tokens:

• If there are multiple tokens can cause C tomisclassify S when they replace the mask, wechoose the token which makes Sadv most sim-ilar to the original S based on the USE score.• If no token causes misclassification, we

choose the perturbation that decreases the pre-diction probability P (C(Sadv)=y) the most.

The perturbations are applied iteratively to the in-put tokens in decreasing order of importance, untileither C(Sadv)6=y (successful attack) or all the to-kens of S have been perturbed (failed attack).

We present 4 attack modes for BAE based on theR and I operations, where for each token t in S:

• BAE-R: Replace token t (See Algorithm 1)• BAE-I: Insert a token to the left or right of t• BAE-R/I: Either replace token t or insert a

token to the left or right of t• BAE-R+I: First replace token t, then insert a

token to the left or right of t

Page 3: BAE: BERT-based Adversarial Examples for Text Classificationexamples. These adversarial examples do not ac-count for the overall semantics of the sentence, and consider only the token

Model Adversarial Attack DatasetsAmazon Yelp IMDB MR MPQA Subj TREC

Word-LSTM

Original 88.0 85.0 82.0 81.16 89.43 91.9 90.2TEXTFOOLER 31.0 (0.747) 28.0 (0.829) 20.0 (0.828) 25.49 (0.906) 48.49 (0.745) 58.5 (0.882) 42.4 (0.834)

BAE-R 21.0 20.0 22.0 24.17 45.66 50.2 32.4BAE-I 17.0 22.0 23.0 19.11 40.94 49.8 18.0BAE-R/I 16.0 19.0 8.0 15.08 31.60 43.1 20.4BAE-R+I 4.0 (0.848) 9.0 (0.902) 5.0 (0.871) 7.50 (0.935) 25.57 (0.766) 29.0 (0.929) 11.8 (0.874)

Word-CNN

Original 82.0 85.0 81.0 76.66 89.06 91.3 93.2TEXTFOOLER 42.0 (0.776) 36.0 (0.827) 31.0 (0.854) 21.18 (0.910) 48.77 (0.733) 58.9 (0.889) 47.6 (0.812)

BAE-R 16.0 23.0 23.0 20.81 44.43 51.0 29.6BAE-I 18.0 26.0 29.0 19.49 44.43 49.8 15.4BAE-R/I 13.0 17.0 20.0 15.56 32.17 41.5 13.0BAE-R+I 2.0 (0.859) 9.0 (0.891) 14.0 (0.861) 7.87 (0.938) 27.83 (0.764) 31.1 (0.922) 8.4 (0.858)

BERT

Original 96.0 95.0 85.0 85.28 90.66 97.0 97.6TEXTFOOLER 30.0 (0.787) 27.0 (0.833) 32.0 (0.877) 30.74 (0.902) 36.23 (0.761) 69.5 (0.858) 42.8 (0.866)

BAE-R 36.0 31.0 46.0 44.05 43.87 77.2 37.2BAE-I 20.0 25.0 31.0 32.05 33.49 74.6 32.2BAE-R/I 11.0 (0.899) 16.0 22.0 20.34 24.53 64.0 23.6BAE-R+I 14.0 12.0 (0.871) 16.0 (0.856) 19.21 (0.917) 24.34 (0.766) 58.5 (0.875) 20.2 (0.825)

Table 2: Test accuracies on adversarial attack across models and datasets. The average semantic similarity betweenthe original and adversarial examples are mentioned in parentheses for the baseline and the strongest BAE attack.

Algorithm 1: BAE-R PseudocodeInput: Sentence S = [t1, . . . , tn], ground truth

label y, classifier model COutput: Adversarial Example SadvInitialization: Sadv ← SCompute token importance Ii ∀ ti ∈ Sfor i in descending order of Ii do

SM ← Sadv[1:i−1][M ]Sadv[i+1:n]

Predict top-K tokens T for mask M ∈ SMT← FILTER(T)L = {}for t ∈ T do

L[t] = Sadv[1:i−1][t]Sadv[i+1:n]

endIf ∃ t ∈ T s.t C(L[t]) 6= y then

Return Sadv ← L[t′] whereC(L[t′]) 6= y and L[t′] hasmaximum similarity with S

ElseSadv ← L[t′] where L[t′] causesmaximum reduction in probabilityof label y in C(L[t′])

endendReturn Sadv ← None

3 Experiments

Datasets and Models We evaluate our adversarialattacks on different text classification datasets fromtasks such as sentiment classification, subjectivitydetection and question type classification. Amazon,Yelp, IMDB are sentence-level sentiment classi-fication datasets which have been used in recentwork (Sarma et al., 2018) while MR (Pang and Lee,2005) contains movie reviews based on sentimentpolarity. MPQA (Wiebe and Wilson, 2005) is a

dataset for opinion polarity detection, Subj (Pangand Lee, 2004) for classifying a sentence as subjec-tive or objective and TREC (Li and Roth, 2002) isa dataset for question type classification.We use 3 popular text classification models: word-LSTM (Hochreiter and Schmidhuber, 1997), word-CNN (Kim, 2014) and a fine-tuned BERT (De-vlin et al., 2018) base-uncased classifier. For eachdataset we train the model on the training data andperform the adversarial attack on the test data. Forcomplete model details refer to Appendix.

As a baseline, we consider TextFooler (Jin et al.,2019) which performs synonym replacement us-ing a fixed word embedding space (Mrksic et al.,2016). We only consider the top K=50 synonymsfrom the MLM predictions and set a threshold of0.8 for the cosine similarity between USE basedembeddings of the adversarial and input text.

Dataset # Classes Train Test Avg LengthAmazon 2 900 100 10.29

Yelp 2 900 100 11.66IMDB 2 900 100 17.56

MR 2 9595 1067 20.04MPQA 2 9543 1060 3.24

Subj 2 9000 1000 23.46TREC 6 5951 500 7.57

Table 1: Summary statistics for the datasets

Results We perform the 4 modes of our attack andsummarize the results in Table 1. Across datasetsand models, our BAE attacks are almost alwaysmore effective than the baseline attack, achievingsignificant drops of 40-80% in test accuracies, withhigher average semantic similarities as shown inparentheses. BAE-R+I is the strongest attack sinceit allows both replacement and insertion at the sametoken position, with just one exception. We observea general trend that the BAE-R and BAE-I attacksoften perform comparably, while the BAE-R/I

Page 4: BAE: BERT-based Adversarial Examples for Text Classificationexamples. These adversarial examples do not ac-count for the overall semantics of the sentence, and consider only the token

(a) Word-CNN (b) BERTFigure 2: Graphs comparing the performance of the at-tacks on the TREC question classification dataset, as afunction of maximum % perturbation to the input. AllBAE attacks outperform the baseline TEXTFOOLER .

and BAE-R+I attacks are much stronger. We ob-serve that the BERT-based classifier is more robustto the BAE and TextFooler attacks than the word-LSTM and word-CNN models which can be at-tributed to its large size and pre-training on a largecorpus.

The baseline attack is often stronger than theBAE-R and BAE-I attacks for the BERT basedclassifier. We attribute this to the shared parame-ter space between the BERT-MLM and the BERTclassifier before fine-tuning. The predicted tokensfrom BERT-MLM may not drastically change theinternal representations learned by the BERT clas-sifier, hindering their ability to adversarially affectthe classifier prediction.Effectiveness We study the effectiveness of BAEon limiting the number of R/I operations permittedon the original text. We plot the attack performanceas a function of maximum % perturbation (ratio ofnumber of word replacements and insertions to thelength of the original text) for the TREC dataset.From Figure 2, we clearly observe that the BAEattacks are consistently stronger than TextFooler.The classifier models are relatively robust to pertur-bations up to 20%, while the effectiveness saturatesat 40-50%. Surprisingly, a 50% perturbation for theTREC dataset translates to replacing or insertingjust 3-4 words, due to the short text lengths.Qualitative Examples We present adversarial ex-amples generated by the attacks on a sentencefrom the IMDB and Yelp datasets in Table 3.BAE produces more natural looking examples thanTextFooler as tokens predicted by the BERT-MLMfit well in the sentence context. TextFooler tendsto replace words with complex synonyms, whichcan be easily detected. Moreover, BAE’s additionaldegree of freedom to insert tokens allows for asuccessful attack with fewer perturbations.Human Evaluation We consider successful ad-versarial examples generated from the Amazon and

Original [Positive Sentiment]: This film offers many delights and surprises.TEXTFOOLER: This flick citations disparate revel and surprises.BAE-R: This movie offers enough delights and surprisesBAE-I: This lovely film platform offers many pleasant delights and surprisesBAE-R/I: This lovely film serves several pleasure and surprises .BAE-R+I: This beautiful movie offers many pleasant delights and surprises .Original [Positive Sentiment]: Our server was great and we had perfect service .TEXTFOOLER: Our server was tremendous and we assumed faultless services.BAE-R: Our server was decent and we had outstanding service.BAE-I: Our server was great enough and we had perfect service but.BAE-R/I: Our server was great enough and we needed perfect service but.BAE-R+I: Our server was decent company and we had adequate service.

Table 3: Qualitative examples of each attack on theBERT classifier (Replacements: Red, Inserts: Blue)

IMDB datasets and verify their sentiment and gram-matical correctness. Human evaluators annotatedthe sentiment and the grammar (Likert scale of 1-5) of randomly shuffled adversarial examples andoriginal texts. From Table 4, BAE and TextFoolerhave inferior accuracies compared to the Original,showing they are not always perfect. However,BAE has much better grammar scores, suggestingmore natural looking adversarial examples.

Dataset Sentiment Accuracy (%) Avg. Grammar Score (1-5)Original TF BAE-R+I Original TF BAE-R+I

Amazon 96.5 78.8 82.9 4.38 3.11 3.73IMDB 90.5 83.6 78.9 4.42 3.36 3.69

Table 4: Human evaluation of accuracies and grammat-icality scores. TF refers to TextFooler.

Ablation Study We analyze the benefits of R/Ioperations in BAE in Table 5. From the table, thesplits A and B are the % of test points which com-pulsorily need I and R operations respectively for asuccessful attack. We can observe that the split A islarger than B thereby indicating the importance ofthe I operation over R. Test points in split C requireboth R and I operations for a successful attack. In-terestingly, split C is largest for Subj, which is themost robust to attack (Table 1) and hence needsboth R/I operations. Thus, this study gives posi-tive insights towards the importance of having theflexibility to both replace and insert words.

Dataset Word-LSTM Word-CNN BERTA B C A B C A B C

MR 15.1 10.1 3.1 12.4 9.6 2.8 24.3 12.9 5.7Subj 14.4 12.3 5.1 16.2 13.8 7.4 13.9 11.4 7.5

TREC 16.6 1.6 0.2 20.0 5.0 1.4 14.0 8.6 2.4

Table 5: A denotes the % of test instances which aresuccessfully attacked by BAE-R/I, but not BAE-R ,i.e, ((R/I)∩ R) ; B : (R/I)∩ I ; C : (R/I)∩ R∩ I

Refer to the Appendix for additional results, effec-tiveness graphs and details of human evaluation.

Page 5: BAE: BERT-based Adversarial Examples for Text Classificationexamples. These adversarial examples do not ac-count for the overall semantics of the sentence, and consider only the token

4 ConclusionIn this paper, we have presented a novel techniquefor generating adversarial examples (BAE) basedon a language model. The results obtained onseveral text classification datasets demonstrate thestrength and effectiveness of our attack.

ReferencesDaniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua,

Nicole Limtiaco, Rhomni St. John, Noah Con-stant, Mario Guajardo-Cespedes, Steve Yuan, ChrisTar, Yun-Hsuan Sung, Brian Strope, and RayKurzweil. 2018. Universal sentence encoder. CoRR,abs/1803.11175.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

Javid Ebrahimi, Anyi Rao, Daniel Lowd, and DejingDou. 2018. HotFlip: White-box adversarial exam-ples for text classification. In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 2: Short Papers), pages31–36, Melbourne, Australia. Association for Com-putational Linguistics.

Shi Feng, Eric Wallace, Mohit Iyyer, Pedro Rodriguez,Alvin Grissom II, and Jordan L. Boyd-Graber. 2018.Right answer for the wrong reason: Discovery andmitigation. CoRR, abs/1804.07781.

Ji Gao, Jack Lanchantin, Mary Lou Soffa, and YanjunQi. 2018. Black-box generation of adversarial textsequences to evade deep learning classifiers. CoRR,abs/1801.04354.

Ian Goodfellow, Jonathon Shlens, and ChristianSzegedy. 2015. Explaining and harnessing adversar-ial examples. In International Conference on Learn-ing Representations.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural Comput., 9(8):1735–1780.

Di Jin, Zhijing Jin, Joey Tianyi Zhou, and PeterSzolovits. 2019. Is bert really robust? natural lan-guage attack on text classification and entailment.arXiv preprint arXiv:1907.11932.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing, EMNLP 2014, October 25-29,2014, Doha, Qatar, A meeting of SIGDAT, a SpecialInterest Group of the ACL, pages 1746–1751.

Alexey Kurakin, Ian Goodfellow, and Samy Bengio.2017. Adversarial examples in the physical world.ICLR Workshop.

Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. Un-derstanding neural networks through representationerasure. CoRR, abs/1612.08220.

Xin Li and Dan Roth. 2002. Learning question clas-sifiers. In Proceedings of the 19th InternationalConference on Computational Linguistics - Volume1, COLING ’02, pages 1–7, Stroudsburg, PA, USA.Association for Computational Linguistics.

Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian,Xirong Li, and Wenchang Shi. 2017. Deep text clas-sification can be fooled. CoRR, abs/1704.08006.

Nikola Mrksic, Diarmuid O Seaghdha, Blaise Thom-son, Milica Gasic, Lina Rojas-Barahona, Pei-HaoSu, David Vandyke, Tsung-Hsien Wen, and SteveYoung. 2016. Counter-fitting word vectors to lin-guistic constraints. In Proceedings of HLT-NAACL.

Bo Pang and Lillian Lee. 2004. A sentimental educa-tion: Sentiment analysis using subjectivity. In Pro-ceedings of ACL, pages 271–278.

Bo Pang and Lillian Lee. 2005. Seeing stars: Exploit-ing class relationships for sentiment categorizationwith respect to rating scales. In Proceedings of ACL,pages 115–124.

Nicolas Papernot, Patrick McDaniel, Ian Goodfel-low, Somesh Jha, Z. Berkay Celik, and AnanthramSwami. 2017. Practical black-box attacks againstmachine learning. In Proceedings of the 2017 ACMon Asia Conference on Computer and Communica-tions Security, ASIA CCS ’17, pages 506–519, NewYork, NY, USA. ACM.

Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

Prathusha K Sarma, Yingyu Liang, and Bill Sethares.2018. Domain adapted word embeddings for im-proved sentiment classification. In Proceedings ofthe 56th Annual Meeting of the Association for Com-putational Linguistics (Volume 2: Short Papers),pages 37–42.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,Joan Bruna, Dumitru Erhan, Ian Goodfellow, andRob Fergus. 2014. Intriguing properties of neuralnetworks. In International Conference on LearningRepresentations.

Janyce Wiebe and Theresa Wilson. 2005. Annotatingexpressions of opinions and emotions in language.Language Resources and Evaluation, 39(2):165–210.

Wei Emma Zhang, Quan Z. Sheng, and Ahoud Abdul-rahmn F. Alhazmi. 2019. Generating textual adver-sarial examples for deep learning models: A survey.CoRR, abs/1901.06796.

Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2018.Generating natural adversarial examples. In Interna-tional Conference on Learning Representations.

Page 6: BAE: BERT-based Adversarial Examples for Text Classificationexamples. These adversarial examples do not ac-count for the overall semantics of the sentence, and consider only the token

A Appendix

A.1 Dataset and Models

The dataset statistics are reported in the main paperand here we give a brief overview of the datasetand the task for which it is used.

• Amazon: A Amazon product reviews datasetwith ‘Positive’ / ‘Negative’ reviews 1.

• Yelp: A restaurant reviews dataset from Yelpwith ‘Positive’ or ‘Negative’ reviews3.

• IMDB: A IMDB movie reviews dataset with‘Positive’ or ‘Negative’ reviews 3.

• MR: A movie reviews dataset based on sub-jective rating and sentiment polarity 2.

• MPQA: An unbalanced dataset for polaritydetection of opinions 3.

• TREC: A dataset for classifying types of ques-tions with 6 classes for questions about a per-son, location, numeric information, etc. 4.

• SUBJ: A dataset for classifying a sentence asobjective or subjective.

A.2 Training Details

On the sentence classification task, we target threemodels: word-based convolutional neural network(WordCNN), wordbased LSTM, and the state-of-the-art BERT. We use 100 filters of sizes 3,4,5 forthe WordCNN model. Similar to (Jin et al., 2019)we use a 1-layer bi-directional LSTM with 150hidden units and a dropout of 0.3. For both models,we use the 300 dimensional pre-trained counterfitted word embeddings. We used the BERT baseuncased model which has 12-layers, 12 attentionheads and 768 hidden dimension size.

A.3 Results

Table 1 is the table of results, with all the seman-tic similarity scores in parentheses. Figures 1-7are the complete set of graphs showing the attackeffectiveness for all datasets.

1https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

2https://www.cs.cornell.edu/people/pabo/movie-review-data/

3http://mpqa.cs.pitt.edu/4http://cogcomp.org/Data/QA/QC/

A.4 Human EvaluationWe consider the input examples for which the at-tack can successfully flip the label of classifier forthe dataset. We shuffle the original text and theadversarial examples and give them to human an-notators. We ask the human annotators to labeleach sentence on the following:

• Predicting the label on the sentence (positiveor negative sentiment) for the dataset and task

• Grammatical correctness of the sentence on aLikert scale of 1-5 : 1) Very poor grammati-cal coherence 2) Poor grammatical coherence3) Average grammatical coherence 4) Goodgrammatical coherence 5) Very good gram-matical coherence (Natural looking sentence)

We average the scores from two independenthuman evaluators and present them in the table inthe main paper.

Page 7: BAE: BERT-based Adversarial Examples for Text Classificationexamples. These adversarial examples do not ac-count for the overall semantics of the sentence, and consider only the token

Model Adversarial Attack DatasetsAmazon Yelp IMDB MR MPQA Subj TREC

Word-LSTM

Original 88.0 85.0 82.0 81.16 89.43 91.9 90.2TEXTFOOLER 31.0 (0.747) 28.0 (0.829) 20.0 (0.828) 25.49 (0.906) 48.49 (0.745) 58.5 (0.882) 42.4 (0.834)

BAE-R 21.0 (0.827) 20.0 (0.885) 22.0 (0.852) 24.17 (0.914) 45.66 (0.748) 50.2 (0.899) 32.4 (0.870)BAE-I 17.0 (0.924) 22.0 (0.928) 23.0 (0.933) 19.11 (0.966) 40.94 (0.871) 49.8 (0.958) 18.0 (0.964)

BAE-R/I 16.0 (0.902) 19.0 (0.924) 8.0 (0.896) 15.08 (0.949) 31.60 (0.820) 43.1 (0.946) 20.4 (0.954)BAE-R+I 4.0 (0.848) 9.0 (0.902) 5.0 (0.871) 7.50 (0.935) 25.57 (0.766) 29.0 (0.929) 11.8 (0.874)

Word-CNN

Original 82.0 85.0 81.0 76.66 89.06 91.3 93.2TEXTFOOLER 42.0 (0.776) 36.0 (0.827) 31.0 (0.854) 21.18 (0.910) 48.77 (0.733) 58.9 (0.889) 47.6 (0.812)

BAE-R 16.0 (0.821) 23.0 (0.846) 23.0 (0.856) 20.81 (0.920) 44.43 (0.735) 51.0 (0.899) 29.6 (0.843)BAE-I 18.0 (0.934) 26.0 (0.941) 29.0 (0.924) 19.49 (0.971) 44.43 (0.876) 49.8 (0.958) 15.4 (0.953)

BAE-R/I 13.0 (0.904) 17.0 (0.916) 20.0 (0.892) 15.56 (0.956) 32.17 (0.818) 41.5 (0.940) 13.0 (0.936)BAE-R+I 2.0 (0.859) 9.0 (0.891) 14.0 (0.861) 7.87 (0.938) 27.83 (0.764) 31.1 (0.922) 8.4 (0.858)

BERT

Original 96.0 95.0 85.0 85.28 90.66 97.0 97.6TEXTFOOLER 30.0 (0.787) 27.0 (0.833) 32.0 (0.877) 30.74 (0.902) 36.23 (0.761) 69.5 (0.858) 42.8 (0.866)

BAE-R 36.0 (0.772) 31.0 (0.856) 46.0 (0.835) 44.05 (0.871) 43.87 (0.764) 77.2 (0.828) 37.2 (0.824)BAE-I 20.0 (0.922) 25.0 (0.936) 31.0 (0.929) 32.05 (0.958) 33.49 (0.862) 74.6 (0.918) 32.2 (0.931)

BAE-R/I 11.0 (0.899) 16.0 (0.916) 22.0 (0.909) 20.34 (0.941) 24.53 (0.826) 64.0 (0.903) 23.6 (0.908)BAE-R+I 14.0 (0.830) 12.0 (0.871) 16.0 (0.856) 19.21 (0.917) 24.34 (0.766) 58.5 (0.875) 20.2 (0.825)

Table 1: Test accuracies on adversarial attack across models and datasets. The average semantic similarity betweenthe original and adversarial examples are mentioned in parentheses.

(a) Word-LSTM (b) Word-CNN (c) BERTFigure 3: Amazon

(a) Word-LSTM (b) Word-CNN (c) BERTFigure 4: Yelp

(a) Word-LSTM (b) Word-CNN (c) BERTFigure 5: IMDB

Page 8: BAE: BERT-based Adversarial Examples for Text Classificationexamples. These adversarial examples do not ac-count for the overall semantics of the sentence, and consider only the token

(a) Word-LSTM (b) Word-CNN (c) BERTFigure 6: MR

(a) Word-LSTM (b) Word-CNN (c) BERTFigure 7: MPQA

(a) Word-LSTM (b) Word-CNN (c) BERTFigure 8: Subj

(a) Word-LSTM (b) Word-CNN (c) BERTFigure 9: TREC