When Does Machine Learning FAIL? Generalized ... · Octavian Suciu Radu M˘arginean Yi gitcan Kaya...

Open access to the Proceedings of the 27th USENIX Security Symposium

is sponsored by USENIX.

When Does Machine Learning FAIL? Generalized Transferability for Evasion and Poisoning Attacks

Octavian Suciu, Radu Marginean, Yigitcan Kaya, Hal Daume III, and Tudor Dumitras, University of Maryland

https://www.usenix.org/conference/usenixsecurity18/presentation/suciu

This paper is included in the Proceedings of the 27th USENIX Security Symposium.

August 15–17, 2018 • Baltimore, MD, USA

978-1-939133-04-5

When Does Machine Learning FAIL? Generalized Transferability forEvasion and Poisoning Attacks

Octavian Suciu Radu Marginean Yigitcan Kaya Hal Daume IIITudor Dumitras,

University of Maryland, College Park

AbstractRecent results suggest that attacks against supervised

machine learning systems are quite effective, while de-fenses are easily bypassed by new attacks. However,the specifications for machine learning systems currentlylack precise adversary definitions, and the existing at-tacks make diverse, potentially unrealistic assumptionsabout the strength of the adversary who launches them.We propose the FAIL attacker model, which describesthe adversary’s knowledge and control along four dimen-sions. Our model allows us to consider a wide range ofweaker adversaries who have limited control and incom-plete knowledge of the features, learning algorithms andtraining instances utilized.

To evaluate the utility of the FAIL model, we considerthe problem of conducting targeted poisoning attacks ina realistic setting: the crafted poison samples must haveclean labels, must be individually and collectively incon-spicuous, and must exhibit a generalized form of trans-ferability, defined by the FAIL model. By taking theseconstraints into account, we design StingRay, a targetedpoisoning attack that is practical against 4 machine learn-ing applications, which use 3 different learning algo-rithms, and can bypass 2 existing defenses. Conversely,we show that a prior evasion attack is less effective undergeneralized transferability. Such attack evaluations, un-der the FAIL adversary model, may also suggest promis-ing directions for future defenses.

1 Introduction

Machine learning (ML) systems are widely deployedin safety-critical domains that carry incentives for po-tential adversaries, such as finance [14], medicine [18],the justice system [31], cybersecurity [1], or self-drivingcars [6]. An ML classifier automatically learns classifi-cation models using labeled observations (samples) froma training set, without requiring predetermined rules for

mapping inputs to labels. It can then apply these mod-els to predict labels for new samples in a testing set. Anadversary knows some or all of the ML system’s param-eters and uses this knowledge to craft training or testingsamples that manipulate the decisions of the ML systemaccording to the adversary’s goal—for example, to avoidbeing sentenced by an ML-enhanced judge.

Recent work has focused primarily on evasion at-tacks [4, 44, 17, 50, 35, 9], which can induce a targetedmisclassification on a specific sample. As illustrated inFigures 1a and 1b, these test time attacks work by mu-tating the target sample to push it across the model’s de-cision boundary, without altering the training process orthe decision boundary itself. They are not applicable insituations where the adversary does not control the tar-get sample—for example, when she aims to influence amalware detector to block a benign app developed by acompetitor. Prior research has also shown the feasibilityof targeted poisoning attacks [34, 32]. As illustrated inFigure 1c, these attacks usually blend crafted instancesinto the training set to push the model’s boundary towardthe target. In consequence, they enable misclassificationsfor instances that the adversary cannot modify.

These attacks appear to be very effective, and thedefenses proposed against them are often bypassed infollow-on work [8]. However, to understand the actualsecurity threat introduced by them, we must model thecapabilities and limitations of realistic adversaries. Eval-uating poisoning and evasion attacks under assumptionsthat overestimate the capabilities of the adversary wouldlead to an inaccurate picture of the security threat posedto real-world applications. For example, test time attacksoften assume white-box access to the victim classifier[9]. As most security-critical ML systems use propri-etary models [1], these attacks might not reflect actualcapabilities of a potential adversary. Black-box attacksconsider weaker adversaries, but they often make rigidassumptions about the adversary’s knowledge when in-vestigating the transferability of an attack. Transferabil-

USENIX Association 27th USENIX Security Symposium 1299

Training Instances Pristine Decision Boundary

Target

(a)

Testing Instances

Adversarial Example

(b)

Poisoning Instances

Poisoned Decision Boundary

(c)

Testing Instances

(d)

Figure 1: Targeted attacks against machine learning classifiers. (a) The pristine classifier would correctly classify the target. (b) An evasion attackwould modify the target to cross the decision boundary. (c) Correctly labeled poisoning instances change the learned decision boundary. (d) Attesting time, the target is misclassified but other instances are correctly classified.

ity is a property of attack samples crafted locally, ona surrogate model that reflects the adversary’s limitedknowledge, allowing them to remain successful againstthe target model. Specifically, black-box attacks ofteninvestigate transferability in the case where the local andtarget models use different training algorithms [36]. Incontrast, ML systems used in the security industry oftenresort to feature secrecy (rather than algorithmic secrecy)to protect themselves against attacks, e.g. by incorporat-ing undisclosed features for malware detection [10].

In this paper, we make a first step towards modelingrealistic adversaries who aim to conduct attacks againstML systems. To this end, we propose the FAIL model,a general framework for the analysis of ML attacks insettings with a variable amount of adversarial knowledgeand control over the victim, along four tunable dimen-sions: Features, Algorithms, Instances, and Leverage.By preventing any implicit assumptions about the adver-sarial capabilities, the model is able to accurately high-light the success rate of a wide range of attacks in realis-tic scenarios and forms a common ground for modelingadversaries. Furthermore, the FAIL framework general-izes the transferability of attacks by providing a multidi-mensional basis for surrogate models. This provides in-sights into the constraints of realistic adversaries, whichcould be explored in future research on defenses againstthese attacks. For example, our evaluation suggests thatcrafting transferable samples with an existing evasion at-tack is more challenging than previously believed.

To evaluate the utility of the FAIL model, we con-sider the problem of conducting targeted poisoning at-tacks in a realistic setting. Specifically, we impose fourconstraints on the adversary. First, the poison samplesmust have clean labels, as the adversary can inject theminto the training set of the model under attack but can-not determine how they are labeled. Second, the samplesmust be individually inconspicuous, i.e. to be very sim-ilar to the existing training instances in order to preventan easy detection, while collectively pushing the model’sboundary toward a target instance. Third, the samplesmyst be collectively inconspicuous by bounding the col-

lateral damage on the victim (Figure 1d). Finally, thepoison samples must exhibit a generalized form of trans-ferability, as the adversary tests the samples on a surro-gate model, trained with partial knowledge along multi-ple dimensions, defined by the FAIL model.

By taking into account the goals, capabilities, and lim-itations of realistic adversaries, we also design StingRay,a targeted poisoning attack that can be applied in a broadrange of settings 1. Moreover, the StingRay attack ismodel agnostic: we describe concrete implementationsagainst 4 ML systems, which use 3 different classifi-cation algorithms (convolutional neural network, linearSVM, and random forest). The instances crafted are ableto bypass three anti-poisoning defenses, including onethat we adapted to account for targeted attacks. By sub-jecting StingRay to the FAIL analysis, we obtain insightsinto the transferability of targeted poison samples, andwe highlight promising directions for investigating de-fenses against this threat.

In summary, this paper makes three contributions:

• We introduce the FAIL model, a general frame-work for modeling realistic adversaries and evalu-ating their impact. The model generalizes the trans-ferability of attacks against ML systems, across var-ious levels of adversarial knowledge and control.We show that a previous black-box evasion attackis less effective under generalized transferability.

• We propose StingRay, a targeted poisoning at-tack that overcomes the limitations of prior attacks.StingRay is effective against 4 real-world classifica-tion tasks, even when launched by a range of weakeradversaries within the FAIL model. The attack alsobypasses two existing anti-poisoning defenses.

• We systematically explore realistic adversarial sce-narios and the effect of partial adversary knowledgeand control on the resilience of ML models againsta test-time attack and a training-time attack. Our

1Our implementation code could be found at https://

github.com/sdsatumd

1300 27th USENIX Security Symposium USENIX Association

https://github.com/sdsatumd

https://github.com/sdsatumd

results provide insights into the transferability of at-tacks across the FAIL dimensions and highlight po-tential directions for investigating defenses againstthese attacks.

This paper is organized as follows. In Section 2 weformalize the problem and our threat model. In Section 3we introduce the FAIL attacker model. In Section 4 wedescribe the StingRay attack and its implementation. Wepresent our experimental results in Section 5, review therelated work in Section 6, and discuss the implications inSection 7.

2 Problem Statement

Lack of a unifying threat model to capture the dimen-sions of adversarial knowledge caused existing work todiverge in terms of adversary specifications. Prior workdefined adversaries with inconsistent capabilities. Forexample, in [36] a black-box adversary possesses knowl-edge of the full feature representations, whereas its coun-terpart in [50] only assumes access to the raw data (i.e.before feature extraction).

Compared to existing white-box or black-box models,in reality, things tend to be more nuanced. A commercialML-based malware detector [1] can rely on a publiclyknown architecture with proprietary data collected fromend hosts, and a mixture of known features (e.g. systemcalls of a binary), and undisclosed features (e.g. reputa-tion scores of the binary). Existing adversary definitionsare too rigid and cannot account for realistic adversariesagainst such applications. In this paper, we ask how canwe systematically model adversaries based on realisticassumptions about their capabilities?

Some of the recent evasion attacks [28, 36] investigatethe transferability property of their solutions. Proventransferability increases the strength of an attack as itallows adversaries with limited knowledge or access tothe victim system to craft effective instances. Further-more, transferability hinders defense strategies as it ren-ders secrecy ineffective. However, existing work gener-ally investigates transferability under single dimensions(e.g. limiting the adversarial knowledge about the vic-tim algorithm). This weak notion of transferability lim-its the understanding of actual attack capabilities on realsystems and fails to shed light on potential avenues fordefenses. This paper aims to provide a means to de-fine and evaluate a more general transferability, across awide range of adversary models. The generalized view ofthreat models highlights limitations of existing training-time attacks. Existing attacks [51, 29, 20] often assumefull control over the training process of victim classi-fiers and have similar shortcomings to white-box attacks.Those that do not assume full control generally omit im-

portant adversarial considerations.Targeted poisoning at-tacks [34, 32, 11] require control of the labeling process.However, an attacker is often unable to determine the la-bels assigned to the poison samples in the training set—consider a case where a malware creator may providea poison sample for the training set of an ML-based mal-ware detector, but its malicious/benign label will be as-signed by the engineers who train the detector. Theseattacks risk being detected by existing defenses as theymight craft samples that stand out from the rest of thetraining set. Moreover, they also risk causing collateraldamage to the classifier; for example, in Figure 1c the at-tack can trigger the misclassification of additional sam-ples from the target’s true class if the boundary is notmolded to include only the target. Such collateral dam-age reduces the trust in the classifier’s predictions, andthus the potential impact of the attack. Therefore, we aimto observe whether an attack could address these limita-tions and discover how realistic is the targeted poisoningthreat?

Machine learning background. For our purpose, aclassifier (or hypothesis) is a function h ∶ X → Y thatmaps instances to labels to perform classification. Aninstance x ∈ X is an entity (e.g., a binary program ) thatmust receive a label y ∈ Y = {y0,y1, ...,ym} (e.g., reflect-ing whether the binary is malicious ). We represent aninstance as a vector x = (x1, . . . ,xn), where the featuresreflect attributes of the artifact (e.g. APIs invoked by thebinary). A function D(x,x′) represents the distance inthe feature space between two instances x,x′ ∈ X . Thefunction h can be viewed as a separator between the mali-cious and benign classes in the feature space X ; the planeof separation between classes is called decision bound-ary. The training set S ⊂ X includes instances that haveknown labels YS ⊂ Y . The labels for instances in S areassigned using an oracle — for a malware classifier, anoracle could be an antivirus service such as VirusTotal,whereas for an image classifier it might be a human anno-tator. The testing set T ⊂ X includes instances for whichthe labels are unknown to the learning algorithm.

Threat model. We focus on targeted poisoning attacksagainst machine learning classifiers. In this setting, werefer to the victim classifier as Alice, the owner of thetarget instance as Bob, and the attacker as Mallory. Boband Mallory could also represent the same entity. Bobpossesses an instance t ∈ T with label yt , called the tar-get, which will get classified by Alice. For example, Bobdevelops a benign application, and he ensures it is notflagged by an oracle antivirus such as VirusTotal. Bob’sexpectation is that Alice would not flag the instance ei-ther. Indeed, the target would be correctly classified byAlice after learning a hypothesis using a pristine trainingset S∗ (i.e. h∗ = A(S∗),h∗(t) = yt ). Mallory has partial


knowledge of Alice’s classifier and read-only access tothe target’s feature representation, but they do not con-trol either t or the natural label yt , which is assigned bythe oracle. Mallory pursues two goals. The first goal isto introduce a targeted misclassification on the target byderiving a training set S from S∗: h = A(S),h(t) = yd ,where yd is Mallory’s desired label for t. On binary clas-sification, this translates to causing a false positive (FP)or false negative (FN). An example of FP would be abenign email message that would be classified as spam,while an FN might be a malicious sample that is not de-tected. In a multiclass setting, Mallory causes the targetto be labeled as a class of choice. Mallory’s second goalis to minimize the effect of the attack on Alice’s over-all classification performance. To quantify this collat-eral damage, we introduce the Performance Drop Ratio(PDR), a metric that reflects the performance hit sufferedby a classifier after poisoning. This is defined as the ra-tio between the performance of the poisoned classifierand that of the pristine classifier: PDR = per f ormance(h)

per f ormance(h∗) .The metric encodes the fact that for a low-error classifier,Mallory could afford a smaller performance drop beforeraising suspicions.

3 Modeling Realistic Adversaries

Knowledge and Capabilities. Realistic adversaries con-ducting training time or testing time attacks are con-strained by an imperfect knowledge about the model un-der attack and by limited capabilities in crafting adver-sarial samples. For an attack to be successful, samplescrafted under these conditions must transfer to the origi-nal model. We formalize the adversary’s strength in theFAIL attacker model, which describes the adversary’sknowledge and capabilities along 4 dimensions:

• Feature knowledge R = {xi ∶ xi ∈ x, xi is readable}:the subset of features known to the adversary.

• Algorithm knowledge A′: the learning algorithmthat the adversary uses to craft poison samples.

• Instance knowledge S′: the labeled training in-stances available to the adversary.

• Leverage W = {xi ∶ xi ∈ x, xi is writable}: the subsetof features that the adversary can modify.

The F and A dimensions constrain the attacker’s under-standing of the hypothesis space. Without knowing thevictim classifier A, the attacker would have to select analternative learning algorithm A′ and hope that the eva-sion or poison samples crafted for models created by A′

transfer to models from A. Similarly, if some features

are unknown (i.e., partial feature knowledge), the modelused for crafting instances is an approximation of theoriginal classifier. For classifiers that learn a represen-tation of the input features (such as neural networks),limiting the F dimension results in a different, approx-imate internal representation that will affect the successrate of the attack. These limitations result in an inaccu-rate assessment of the impact that the crafted instanceswill have and affect the success rate of the attack. TheI dimension affects the accuracy of the adversary’s viewover the instance space. As S′ might be a subset or anapproximation of S∗, the poisoning and evasion sam-ples might exploit gaps in the instance space that arenot present in the victim’s model. This, in turn, couldlead to an impact overestimation on the attacker side. Fi-nally, the L dimension affects the adversary’s ability tocraft attack instances. The set of modifiable features re-stricts the regions of the feature space where the craftedinstances could lie. For poisoning attacks, this places anupper bound on the ability of samples to shift the deci-sion boundary while for evasion it could affect their ef-fectiveness. The read-only features can, in some cases,cancel out the effect of the modified ones. An adversarywith partial leverage needs extra effort, e.g. to craft moreinstances (for poisoning) or to attack more of the modi-fiable features (for both poisoning and evasion).

Prior work has investigated transferability withoutmodeling a full range of realistic adversaries across theFAIL dimensions. [36] focuses on the A dimension, andproposes a transferable evasion attack across differentneural network architectures. Transferability of poison-ing samples in [33] is partially evaluated on the I andA dimensions. The evasion attack in [25] considers F,A and I under a coarse granularity, but omits the L di-mension. ML-based systems employed in the securityindustry [21, 10, 45, 39, 12] often combine undisclosedand known features to render attacks more difficult. Inthis context, the systematic evaluation of transferabilityalong the F and L dimensions is still an open question.

Constraints. The attacker’s strategy is also influencedby a set of constraints that drive the attack design andimplementation. While these are attack-dependent, webroadly classify them into three categories: success, de-fense, and budget constraints. Success constraints encodethe attacker’s goals and considerations that directly affectthe effectiveness of the attack, such as the assessment ofthe target instance classification. Defense constraints re-fer to the attack characteristics aimed to circumvent ex-isting defenses (e.g. the post-attack performance dropon the victim). Budget considerations address the limi-tations in an attacker’s resources, such as the maximumnumber of poisoning instances or, for evasion attacks, themaximum number of queries to the victim model.


Implementing the FAIL dimensions. Performing em-pirical evaluations within the FAIL model requires fur-ther design choices that depend on the application do-main and the attack surface of the system. To simulateweaker adversaries systematically, we formulate a ques-tionnaire to guide the design of experiments focusing oneach dimension of our model.

For the F dimension, we ask: What features couldbe kept as a secret? Could the attacker access the ex-act feature values? Feature subsets may not be publiclyavailable (e.g. derived using a proprietary malware anal-ysis tool, such as dynamic analysis in a contained en-vironment), or they might be directly defined from in-stances not available to the attacker (e.g. low-frequencyword features). Similarly, the exact feature values couldbe unknown ( e.g. because of defensive feature squeez-ing [49]). Feature secrecy does not, however, imply theattacker’s inability to modify them through an indirectprocess [25] or extract surrogate ones.

The questions related to the A dimension are: Is the al-gorithm class known? Is the training algorithm secret?Are the classifier parameters secret? These questions de-fine the spectrum for adversarial knowledge with respectto the learning algorithm: black-box access, if the infor-mation is public, gray-box, where the attacker has partialinformation about the algorithm class or the ensemble ar-chitecture, or white-box, for complete adversarial knowl-edge.

The I dimension controls the overlap between the in-stances available to the attacker and these used by thevictim. Thus, here we ask: Is the entire training setknown? Is the training set partially known? Are the in-stances known to the attacker sufficient to train a robustclassifier? An application might use instances from thepublic domain (e.g. a vulnerability exploit predictor) andthe attacker could leverage them to the full extent in or-der to derive their attack strategy. However, some appli-cations, such as a malware detector, might rely on privateor scarce instances that limit the attacker’s knowledge ofthe instance space. The scarcity of these instances drivesthe robustness of the attacker classifier which in turn de-fines the perceived attack effectiveness. In some cases,the attacker might not have access to any of the origi-nal training instances, being forced to train a surrogateclassifier on independently collected samples [50, 29].

The L dimension encodes the practical capabilities ofthe attacker when crafting attack samples. These aretightly linked to the attack constraints. However, ratherthan being preconditions, they act as degrees of freedomon the attack. Here we ask: Which features are modifi-able by the attacker? and What side effects do the modi-fications have? For some applications, the attacker maynot be able to modify certain types of features, either be-cause they do not control the generating process (e.g. an

Study F A I LTest Time Attacks

Genetic Evasion[50] 3,3 3,3 3,7† 3,3Black-box Evasion[37] 7,∅* 3,3 3,3 7,∅*

Model Stealing[46] 3,3 3,3 3,3 7,∅*FGSM Evasion[17] 7,∅* 7,∅* ∅,∅ 7,∅*Carlini’s Evasion[9] 7,∅* 3,3 ∅,∅ 7,∅*

Training Time AttacksSVM Poisoning[5] 7,∅* 3,7† ∅,∅ 7,∅*NN Poisoning[33] 3,7† 3,3 3,3 7,∅*NN Backdoor[20]2

3,7† 3,3 3,7† 3,3NN Trojan[29] 3,7 3, 3 3,3 3,3

Table 1: FAIL analysis of existing attacks. For each attack, we analyzethe adversary model and evaluation of the proposed technique. Eachcell contains the answers to our two questions, AQ1 and AQ2: yes (3),omitted (7) and irrelevant (∅). We also flag implicit assumptions (*)and a missing evaluation (†).

Study F A I LTest Time Defenses

Distillation[38] 7,3 7,3 7,7 7,7Feature Squeezing[49] 3,3 7,7 7,7 3,3

Training Time DefensesRONI[34] 7,7 7,7 3,7 7,7

Certified Defense[42] 7,7 7,7 3,3 7,7

Table 2: FAIL analysis of existing defenses. We analyze a defense’sapproach to security: DQ1 (secrecy) and DQ2 (hardening). Each cellcontains the answers to the two questions: yes (3), and no (7).

exploit predictor that gathers features from multiple vul-nerability databases) or when the modifications wouldcompromise the instance integrity (e.g. a watermark onimages that prevents the attacker from modifying certainfeatures). In cases of dependence among features, tar-geting a specific set of features could have an indirecteffect on others (e.g. an attacker injecting tweets to mod-ify word feature distributions also changes features basedon tweet counts).

3.1 Unifying Threat Model Assumptions

Discordant threat model definitions result in implicit as-sumptions about adversarial limitations, some of whichmight not be realistic. The FAIL model allows us to sys-tematically reason about such assumptions. To demon-strate its utility, we evaluate a body of existing studies bymeans of answering two questions for each work.

2Gu et al.’s study investigates a scenario where the attacker per-forms the training on behalf of the victim. Consequently, the attackerhas full access to the model architecture, parameters, training set andfeature representation. However, with the emergence of frameworkssuch as [16], even in this threat model, it might be possible that theattacker does not know the training set or the features.


To categorize existing attacks, we first inspect a threatmodel and ask: AQ1–Are bounds for attacker limitationsspecified along the dimension?. The possible answersare: yes, omitted and irrelevant. For instance, the threatmodel in Carlini et al.’s evasion attack [9] specifies thatthe adversary requires complete knowledge of the modeland its parameters, thus the answer is yes for the A di-mension. In contrast, the analysis on the I dimensionis irrelevant because the attack does not require accessto the victim training set. However, the study does notdiscuss feature knowledge, therefore we mark the F di-mension as omitted.

Our second question is: AQ2–Is the proposed tech-nique evaluated along the dimension?. This questionbecomes irrelevant if the threat model specifications areomitted or irrelevant. For example, Carlini et al. evalu-ated transferability of their attack when the attacker doesnot know the target model parameters. This correspondsto the attacker algorithm knowledge, therefore the an-swer is yes for the A dimension.

Applying the FAIL model reveals implicit assump-tions in existing attacks. An implicit assumption exists ifthe attack limitations are not specified along a dimension.Furthermore, even with explicit assumptions, some stud-ies do not evaluate all relevant dimensions. We presentthese findings about previous attacks within the FAILmodel in Table 1.

When looking at existing defenses through the FAILmodel, we aim to observe how they achieve security: ei-ther by hiding information or limiting the attacker ca-pabilities. For defenses that involve creating knowledgeasymmetry between attackers and the defenders, i.e. se-crecy, we ask: DQ1–Is the dimension employed as amechanism for secrecy?. For example, feature squeez-ing [49] employs feature reduction techniques unknownto the attacker; therefore the answer is yes for the F di-mension.

In order to identify hardening dimensions, which at-tempt to limit the attack capabilities, we ask: DQ2–Isthe dimension employed as a mechanism for hardening?.For instance, the distillation defense [38] against evasionmodifies the neural network weights to make the attackmore difficult; therefore the answer is yes for the A di-mension.

These defenses may come with inaccurate assessmentsfor the adversarial capabilities and implicit assumptions.For example, distillation limits adversaries along theF and A dimensions but employing a different attackstrategy could bypass it [9]. On poisoning attacks, theRONI [34] defense assumes training set secrecy, but doesnot evaluate the threat posed by attackers with sufficientknowledge along the other dimensions. As our resultswill demonstrate, this implicit assumption allows attack-ers to bypass the defense while remaining within the se-

crecy bounds.The results for the evaluated defenses are found in Ta-

ble 2. The detailed evaluation process for each of thesestudies can be found in our technical report [43].

4 The StingRay Attack

Reasoning about implicit and explicit assumptions inprior defenses allows us to design algorithms which ex-ploit their weaknesses. In this section, we introduceStingRay, one such attack that achieves targeted poison-ing while preserving overall classification performance.StingRay is a general framework for crafting poison sam-ples.

At a high level, our attack builds a set of poison in-stances by starting from base instances that are close tothe target in the feature space but are labeled as the de-sired target label yd , as illustrated in the example fromFigure 2. Here, the adversary has created a maliciousAndroid app t, which includes suspicious features (e.g.the WRITE_CONTACTS permission on the left side of thefigure), and wishes to prevent a malware detector fromflagging this app. The adversary, therefore, selects a be-nign app xb as a base instance. To craft each poison in-stance, StingRay alters a subset of a base instance’s fea-tures so that they resemble those of the target. As shownon the right side of Figure 2, these are not necessarilythe most suspicious features, so that the crafted instancewill likely be considered benign. Finally, StingRay fil-ters crafted instances based on their negative impact oninstances from S′, ensuring that their individual effecton the target classification performance is negligible.The sample crafting procedure is repeated until thereare enough instances to trigger the misclassification oft. Algorithm 1 shows the pseudocode of the attack’s twogeneral-purpose procedures .

We describe concrete implementations of our attackagainst four existing applications: an image recognitionsystem, an Android malware detector, a Twitter-basedexploit predictor, and a data breach predictor. We re-implement the systems that are not publicly available,using the original classification algorithms and the origi-nal training sets to reproduce those systems as closely aspossible. In total, our applications utilize three classifi-cation algorithms—convolutional neural network, linearSVM, and random forest—that have distinct character-istics. This spectrum illustrates the first challenge forour attack: identifying and encapsulating the application-specific steps in StingRay, to adopt a modular designwith broad applicability. Making poisoning attacks prac-tical raises additional challenges. For example, a naıveapproach would be to inject the target with the desiredlabel into the training set: h(t) = yd (S.I). However, thisis impractical because the adversary, under our threat


Algorithm 1 The StingRay attack.

1: procedure STINGRAY(S′,YS′ , t,yt ,yd )2: I =∅3: h = A′(S′)4: repeat5: xb = GETBASEINSTANCE(S′,YS′ , t,yt ,yd)6: xc = CRAFTINSTANCE(xb, t)7: if GETNEGATIVEIMPACT(S′,xc) < τNI then8: I = I∪{xc}9: h = A′(S′∪ I)

10: until (∣I∣ > Nmin and h(t) = yd) or ∣I∣ > Nmax11: PDR = GETPDR(S′,YS′ , I,yd)12: if h(t) ≠ yd or PDR < τPDR then13: return ∅14: return I15: procedure GETBASEINSTANCE(S′,YS′ , t,yt ,yd )16: for xb,yb in SHUFFLE(S′,YS′) do17: if D(t,xb) < τD and yb = yd then18: return xb

model, does not control the labeling function. There-fore, GETBASEINSTANCE works by selecting instancesxb that already have the desired label and are close to thetarget in the feature space (S.II).

A more sophisticated approach would mutate thesesamples and use poison instances to push the modelboundary toward the target’s class [32]. However, theseinstances might resemble the target class too much, andthey might not receive the desired label from the oracleor even get flagged by an outlier detector. In CRAFTIN-STANCE, we apply tiny perturbations to the instances(D.III) and by checking the negative impact NI of craftedpoisoning instances on the classifier (D.IV) we ensurethey remain individually inconspicuous.

Mutating these instances with respect to the target [34](as illustrated in Figure 1c) may still reduce the overallperformance of the classifier (e.g. by causing the mis-classification of additional samples similar to the target).We overcome this via GETPDR by checking the perfor-mance drop of the attack samples (S.V), therefore ensur-ing that they remain collectively inconspicuous.

Even so, the StingRay attack adds robustness to thepoison instances by crafting more instances than neces-sary, to overcome sampling-based defenses (D.VI). Nev-ertheless, the attack has a sampling budget that dictatesthe allowable number of crafted instances (B.VII). A de-tailed description of StingRay is found in Appendix A.

Attack Constraints. The attack presented above has aseries of constraints that shape its effectiveness. Rea-soning about them allows us to adapt StingRay to thespecific restrictions on each application. These span allthree categories identified in Section 3: Success(S.), De-

fense(D.) and Budget(B.):

S.I h(t) = yd : the desired class label for target

S.II D(t,xb) < τD: the inter-instance distance metric

D.III s = 1∣I∣ ∑

xc∈Is(xc, t), where s(⋅, ⋅) is a similarity met-

ric: crafting target resemblance

D.IV NI < τNI : negative impact of poisoning instances

S.V PDR < τPDR: the perceived performance drop

D.VI ∣I∣ ≥ Nmin: the minimum number of poison in-stances

B.VII ∣I∣ ≤ Nmax: maximum number of poisoning in-stances

The perceived success of the attacker goals (S.I andS.V) dictate whether the attack is triggered. If the PDRis large, the attack might become indiscriminate and therisk of degrading the overall classifier’s performance ishigh. The actual PDR could only be computed in thewhite-box setting. For scenarios with partial knowledge,it is approximated through the perceived PDR on theavailable classifier.

The impact of crafted instances is influenced by thedistance metric and the feature space used to measureinstance similarity (S.II). For applications that learn fea-ture representations (e.g. neural nets), the similarity oflearned features might be a better choice for minimizingthe crafting effort.

The set of features that are actively modified by the at-tacker in the crafted instances (D.III) defines the targetresemblance for the attacker, which imposes a trade-offbetween their inconspicuousness and the effectiveness ofthe sample. If this quantity is small, the crafted instancesare less likely to be perceived as outliers, but a largernumber of them is required to trigger the attack. A higherresemblance could also cause the oracle to assign craftedinstances a different label than the one desired by the at-tacker.

The loss difference of a classifier trained with andwithout a crafted instance (D.IV) approximates the neg-ative impact of that instance on the classifier. It may beeasy for an attacker to craft instances with a high nega-tive impact, but these instances may also be easy to detectusing existing defenses.

In practice, the cost of injecting instances in the train-ing set can be high (e.g. controlling a network of bots inorder to send fake tweets) so the attacker aims to min-imize the number of poison instances (D.VI) used inthe attack. The adversary might also discard crafted in-stances that do not have the desired impact on the ML


model. Additionally, some poison instances might be fil-tered before being ingested by the victim classifier. How-ever, if the number of crafted instances falls below athreshold Nmin, the attack will not succeed. The max-imum number of instances that can be crafted (B.VII)influences the outcome of the attack. If the attacker is un-able to find sufficient poison samples after crafting Nmaxinstances, they might conclude that the large fraction ofpoison instances in the training set would trigger suspi-cions or that they depleted the crafting budget.

Delivering Poisoning Instances. The mechanismthrough which poisoning instances are delivered to thevictim classifier is dictated by the application character-istics and the adversarial knowledge. In the most generalscenario, the attacker injects the crafted instances along-side existing ones, expecting that the victim classifierwill be trained on them. For applications where modelsare updated over time or trained in mini-batches (suchas an image classifier based on neural networks), the at-tacker only requires control over a subset of such batchesand might choose to deliver poison instances throughthem. In cases where the attacker is unable to create newinstances (such as a vulnerability exploit predictor), theywill rely on modifying the features of existing ones bypoisoning the feature extraction process. The applica-tions we use to showcase StingRay highlight these sce-narios and different attack design considerations.

4.1 Bypassing Anti-Poisoning Defenses

In this section, we discuss three defenses against poison-ing attacks and how StingRay exploits their limitations.

The Micromodels defense was proposed for cleaningtraining data for network intrusion detectors [13]. Thedefense trains classifiers on non-overlapping epochs ofthe training set (micromodels) and evaluates them on thetraining set. By using a majority voting of the micro-models, training instances are marked as either safe orsuspicious. Intuition is that attacks have relatively lowduration and they could only affect a few micromodels. Italso relies on the availability of accurate instance times-tamps.

Reject on Negative Impact (RONI) was proposedagainst spam filter poisoning attacks [3]. It measures theincremental effect of each individual suspicious traininginstance and discards the ones with a relatively signifi-cant negative impact on the overall performance. RONIsets a threshold by observing the average negative impactof each instance in the training set and flags an instancewhen its performance impact exceeds the threshold. Thisthreshold determines RONI’s ultimate effectiveness andability to identify poisoning samples. The defense alsorequires a sizable clean set for testing instances. We

adapted RONI to a more realistic scenario, assuming noclean holdout set, implementing an iterative variant, assuggested in [41], that incrementally decreases the al-lowed performance degradation threshold. To the best ofour knowledge, this version has not been implementedand evaluated before. However, RONI remains compu-tationally inefficient as the number of trained classifiersscales linearly with the training set.

Target-aware RONI (tRONI) builds on the observationthat RONI fails to mitigate targeted attacks [34] becausethe poison instances might not individually cause a sig-nificant performance drop. We propose a targeted variantwhich leverages prior knowledge about a test-time mis-classification to determine training instances that mighthave caused it. While RONI estimates the negative im-pact of an instance on a holdout set, tRONI considerstheir effect on the target classification alone. ThereforetRONI is only capable of identifying instances that dis-tort the target classification significantly. A detailed de-scription of this defense is available in the technical re-port [43].

All these defenses aim to increase adversarial costs byforcing attackers to craft instances that result in a smallloss difference (Cost D.IV). Therefore, they implicitlyassume that poisoning instances stand out from the rest,and they negatively affect the victim classifier. However,attacks such as StingRay could exploit this assumptionto evade detection by crafting a small number of incon-spicuous poison samples.

4.2 Attack Implementation

We implement StingRay against four applications withdistinct characteristics, each highlighting realistic con-straints for the attacker. We omit certain technical detailsfor space considerations, encouraging interested readersto consult the technical report [43].

Image classification. We first poison a neural-network(NN) based application for image classification, oftenused for demonstrating evasion attacks in the prior work.The input instances are images and the labels correspondto objects that are depicted in the image (e.g. airplane,dog, ship). We evaluate StingRay on our own implemen-tation for CIFAR-10 [24]. 10,000 instances (1/6 of thedata set) are used for validation and testing. In this sce-nario, the attacker has an image t with true label yt (e.g.a dog) and wishes to trick the model into classifying it asa specific class yd (e.g. a cat).

We implement a neural network architecture thatachieves a performance comparable to other studies [38,9], obtaining a validation accuracy of 78%. Once thenetwork is trained on the benign inputs, we proceed topoison the classifier. We generate and group poison in-


stances into batches alongside benign inputs. We defineγ ∈ [0,1] to be the mixing parameter which controls thenumber of poison instances in a batch. In our experi-ments we varied γ over {0.125,0.5,1.0} (i.e. 4, 16, and32 instances of the batch are poison) and selected thevalue that provided the best attack success rate, keepingit fixed across successive updates. We then update3 thepreviously trained network using these batches until ei-ther the attack is perceived as successful or we exceedthe number of available poisoning instances, dictated bythe cut-off threshold of Nmax. It is worth noting that if thelearning rate is high and the batch contains too many poi-son instances, the attack could become indiscriminate.Conversely, too few crafted instances would not succeedin changing the target prediction, so the attacker needs tocontrol more batches.

The main insight that motivates our method for gen-erating adversarial samples is that there exist inputs to anetwork x1,x2 whose distance in pixel space ∣∣x1−x2∣∣is much smaller than their distance in deep feature space∣∣Hi(x1)−Hi(x2)∣∣, where Hi(x) is the value of the ith

hidden layer’s activation for the input x. This insight ismotivated by the very existence of test-time adversarialexamples, where inputs to the classifier are very similarin pixel space, but are successfully misclassified by theneural network [4, 44, 17, 50, 37, 9]. Our attack consistsof selecting base instances that are close to the target tin deep feature space, but are labeled by the oracle asthe attacker’s desired label yd . CRAFTINSTANCE cre-ates poison images such that the distance to the target tin deep feature space is minimized and the resulting ad-versarial image is within τD distance in pixel space to t.Recent observations suggest that features in the deeperlayers of neural networks are not transferable [52]. Thissuggests that the selection of the layer index i in the ob-jective function offers a trade-off between attack trans-ferability and the magnitude of perturbations in craftedimages (Cost D.III). In our experiments we choose Hi tobe the third convolutional layer.

We pick 100 target instances uniformly distributedacross the class labels. The desired label yd is the oneclosest to the true label yt from the attacker’s classifierpoint of view (i.e. it is the second best guess of the clas-sifier). We set the cut-off threshold Nmax = 64, equivalentto two mini-batches of 32 examples. The perturbation isupper-bounded at τD < 3.5% resulting in a target resem-blance s < 110 pixels.

Android malware detection. The Drebin Android mal-ware detector [2] uses a linear SVM classifier to predictif an application is malicious or benign. The Drebindata set consists of 123,453 Android apps, including

3 The update is performed on the entire network (i.e. all layers areupdated).

target: t (malicious)api_call::setWifiEnabledpermission::WRITE_CONTACTSpermission::ACCESS_WIFI_STATEpermission::READ_CONTACTS

…

poison: xc (benign)intent::LAUNCHERintent::MAINpermission::ACCESS_WIFI_STATEactivity::MainActivitypermission::READ_CONTACTS

…

Legend: Features tagged as suspicious by VTFeatures copied from t to xc

Figure 2: The sample crafting process illustrated for the Drebin An-droid malware detector. Suspicious features are emphasized in Virus-Total reports using an opaque internal process, but the attacker is notconstrained to copying them.

5,560 malware samples. These were labeled using 10AV engines on VirusTotal [48], considering apps withat least two detections as malicious. The feature spacehas 545,333 dimensions. We use stratified sampling andsplit the data set into 60%-40% folds training and test-ing respectively, aiming to mimic the original classi-fier. Our implementation achieves 94% F1 on the test-ing set. The features are extracted from the applicationarchives (APKs) using two techniques. First, from theAndroidManifest XML file, which contains meta infor-mation about the app, the authors extract the permissionrequested, the application components and the registeredsystem callbacks. Second, after disassembling the dexfile, which contains the app bytecode, the system ex-tracts suspicious Android framework calls, actual per-mission usage and hardcoded URLs. The features arerepresented as bit vectors, where each index specifieswhether the application contains a feature. The adver-sary aims to misclassify an Android app t. Although theproblems of inducing a targeted false positive (FP) and atargeted false negative (FN) are analogous from the per-spective of our definitions, in practice the adversary islikely more interested in targeted FNs, so we focus onthis case in our experiments. We evaluate this attack byselecting target instances from the testing set that wouldbe correctly labeled as malicious by the classifier. Wethen craft instances by adding active features (permis-sions, API calls, URL requests) from the target to exist-ing benign instances, as illustrated in Figure 2. Each ofthe crafted apps will have a subset of the features of t,to remain individually inconspicuous. The poisoning in-stances are mixed with the pristine ones and used to trainthe victim classifier from scratch.

We craft 1,717 attacks to test the attack on the Drebinclassifier. We use a cutoff threshold Nmax = 425, whichcorresponds to 0.5% of the training set. The base in-stances are selected using the Manhattan distance D = l1and each poisoning instance has a target resemblance ofs = 10 features and a negative impact τNI < 50%.

Twitter-based exploit prediction. In [40], the authorspresent a system, based on a linear SVM, that predicts


which vulnerabilities are going to be exploited usingfeatures extracted from Twitter and public vulnerabilitydatabases. For each vulnerability, the predictor extractsword-based features (e.g. the number of tweets contain-ing the word code), Twitter statistics (e.g. number ofdistinct users that tweeted about it), and domain-specificfeatures for the vulnerability (e.g. CVSS score). The dataset contains 4,140 instances out of which 268 are labeledas positive (a proof-of-concept exploit is publicly avail-able). The classifier uses 72 features from 4 categories:CVSS Score, Vulnerability Database, Twitter traffic andTwitter word features. Due to the class imbalance, weuse stratified samples of 60%–40% of the data set fortraining and testing respectively, obtaining a 40% testingF1.

The targeted attack selects a set I of vulnerabilitiesthat are similar to t (e.g. same product or vulnerabilitycategory), have no known exploits, and gathered fewertweets. It then proceeds to post crafted tweets aboutthese vulnerabilities that include terms normally foundin the tweets about the target vulnerability. In this man-ner, the classifier gradually learns that these terms in-dicate vulnerabilities that are not exploited. However,the attacker’s leverage is limited since the features ex-tracted from sources other than Twitter are not under theattacker’s control.

We simulate 1,932 attacks setting Nmax = 20 and se-lecting the CVEs to be poisoned using the Euclidean dis-tance D = l2 with τNI < 50%.

Data breach prediction. The fourth application we an-alyze is a data breach predictor proposed in [30]. Thesystem attempts to predict whether an organization isgoing to suffer a data breach, by using a random for-est classifier. The features used in classification in-clude indications of bad IT hygiene (e.g. misconfig-ured DNS servers) and malicious activity reports (e.g.blacklisting of IP addresses belonging to the organiza-tion). These features are absolute values (i.e. organi-zation size), as well as time series based statistics (e.g.duration of attacks). The Data Breach Investigations Re-ports (DBIR) [47] provides the ground truth. The classi-fier uses 2,292 instances with 382 positive-labeled exam-ples. The 74 existing features are extracted from exter-nally observable network misconfiguration symptoms aswell as blacklisting information about hosts in an organi-zation’s network. A similar technique is used to computethe FICO Enterprise Security Score [15]. We use strat-ified sampling to build a training set containing 50% ofthe corpus and use the rest for testing and choosing tar-gets for the attacks. The classifier achieves a 60% F1score on the testing set.

In this case, the adversary plans to hack an organi-zation t, but wants to avoid triggering an incident pre-

diction despite the eventual blacklisting of the organiza-tion’s IPs. In our simulation, we choose t from withinorganizations that were reported in DBIR and were notused at training time, being correctly classified at test-ing. The adversary chooses a set I of organizations thatdo not appear in the DBIR and modifies their feature rep-resentation. The attacker has limited leverage and is onlyable to influence time series based features indirectly, byinjecting information in various blacklists.

We generate 2,002 attacks under two scenarios: theattacker has compromised a blacklist and is able to in-fluence the features of many organizations, or the at-tacker has infiltrated a few organizations and it uses themto modify their reputation on all the blacklists. We setNmax = 50 and the instances to be poisoned are selectedusing the Euclidean distance D = l2 with τNI < 50%.

4.3 Practical Considerations

Running time of StingRay. The main computa-tional expenses of StingRay are: crafting the instances inCRAFTINSTANCE, computing the distances to the targetin GETBASEINSTANCE, and measuring the negative im-pact of the crafted instances in GETNEGATIVEIMPACT.

CRAFTINSTANCE depends on the crafting strategyand its complexity in searching for features to perturb.For the image classifier, we adapt an existing evasion at-tack, showing that we could reduce the computationalcost by finding adversarial examples on hidden layers in-stead of the output layer. For all other applications weevaluated, the choice of features is determined in con-stant time.

The GETBASEINSTANCE procedure computes inter-instance distances once per attack, and it is linear interms of the attackers training set size for a particular la-bel. For larger data sets the distance computation couldbe approximated (e.g. using a low-rank approximation).

In GETNEGATIVEIMPACT, we obtain a good approxi-mation of the negative impact (NI) by training locally-accurate classifiers on small instance subsets and per-forming the impact test on batches of crafted instances.

Labeling poisoning instances. Our attacker model as-sumes that the adversary does not control the oracle usedfor labeling poisoning instances. Although the attackercould craft poisoning instances that closely resemble thetarget t to make them more powerful, they could beflagged as outliers or the oracle could assign them a labelthat is detrimental for the attack. It is therefore beneficialto reason about the oracles specific to all applications andthe mechanisms used by StingRay to obtain the desiredlabels.

For the image classifier, the most common oracle isa consensus of human analysts. In an attempt to map


the effect of adversarial perturbations on human percep-tion, the authors of [35] found through a user study thatthe maximum fraction of perturbed pixels at which hu-mans will correctly label an image is 14%. We, there-fore, designed our experiments to remain within thesebounds. Specifically, we measure the pixel space per-turbation as the l∞ distance and discard poison sampleswith τD > 0.14 prior to adding them to I.

The Drebin classifier uses VirusTotal as the oracle.In our experiments, the poison instances would need tomaintain the benign label. We systematically create over19,000 Android applications that correspond to attack in-stances and utilize VirusTotal, in the same way as Drebindoes, to label them. To modify selected features of theAndroid apps, we reverse-engineer Drebin’s feature ex-traction process to generate apps that would have the de-sired feature representation. We generate these applica-tions for the scenario where only the subset of featuresextracted from the AndroidManifest are modifiable bythe attacker, similar to prior work [19]. In 89% of thesecases, the crafted apps bypassed detection, demonstrat-ing the feasibility of our strategy in obtaining negativelylabeled instances. However, in our attack scenario, weassume that the attacker is not consulting the oracle, re-leasing all crafted instances as part of the attack.

For the exploit predictor, labeling is performed inde-pendently of the feature representations of instances usedby the system. The adversary manipulates the public dis-course around existing vulnerabilities, but the label existswith respect to the availability of an exploit. Thereforethe attacker has more degrees of freedom in modifyingthe features of instances in I, knowing that their desiredlabels will be preserved.

In case of the data breach predictor, the attacker uti-lizes organizations with no known breach and aims topoison the blacklists that measure their hygiene, or hacksthem directly. In the first scenario, the attacker does notrequire access to an organization’s networks, thereforethe label will remain intact. The second scenario wouldbe more challenging, as the adversary would require ex-tra capabilities to ensure they remain stealthy while con-ducting the attack.

5 Evaluation

We start by evaluating weaker evasion and poisoning ad-versaries, within the FAIL model, on the image and mal-ware classifiers (Section 5.1). Then, we evaluate the ef-fectiveness of existing defenses against StingRay (Sec-tion 5.2) and its applicability to a larger range of classi-fiers. Our evaluation seeks to answer four research ques-tions: How could we systematically evaluate the trans-ferability of existing evasion attacks? What are the limi-tations of realistic poisoning adversaries? When are tar-

geted poison samples transferable? Is StingRay effectiveagainst multiple applications and defenses? We quantifythe effectiveness of the evasion attack using the percent-age of successful attacks (SR), while for StingRay wealso measure the Performance Drop Ratio (PDR). Wemeasure the PDR on holdout testing sets by consider-ing either the average accuracy, on applications with bal-anced data sets, or the average F1 score (the harmonicmean between precision and recall), which is more ap-propriate for highly imbalanced data sets.

5.1 FAIL Analysis

In this section, we evaluate the image classifier and themalware detector using the FAIL framework. The modelallows us to utilize both a state of the art evasion at-tack as well as StingRay for the task. To control for ad-ditional confounding factors when evaluating StingRay,in this analysis we purposely omit the negative impact-based pruning phase of the attack. We chose to imple-ment the FAIL analysis on the two applications as theydo not present built-in leverage limitations and they havedistinct characteristics.

Evasion attack on the image classifier. The first at-tack subjected to the FAIL analysis is JSMA [35], awell-known targeted evasion attack Transferability ofthis attack has previously been studied only for an ad-versary with limited knowledge along the A and I di-mensions [37]. We attempt to reuse an application con-figuration similar in prior work, implementing our own3-layer convolutional neural network architecture for theMNIST handwritten digit data set [26]. The validationaccuracy of our model is 98.95%. In table 3, we presentthe average results of our 11 experiments, each involving100 attacks.

For each experiment, the table reports the ∆ variationof the FAIL dimension investigated, two SR statistics:perceived (as observed by the attacker on their classifier)and potential (the effect on the victim if all attempts aretriggered by the attacker) as well as the mean perturba-tion τD introduced to the evasion instances.

Experiment #6 corresponds to the white-box adver-sary, where we observe that the white-box attacker couldreach 80% SR.

Experiments #1–2 model the scenario in which theattacker has limited Feature knowledge. Realistically,these scenarios can simulate an evasion or poisoning at-tack against a self-driving system, conducted withoutknowing the vehicle’s camera angles—wide or narrow.We simulate this by cropping a frame of 3 and 6 pix-els from the images, decreasing the available featuresby 32% and 62%, respectively. The attacker uses thecropped images for training and testing the classifier, as


# ∆ SR % τD

1 32% 67/3 0.0702 62% 86/7 0.054

3 shallow 99/10 0.0354 narrow 82/20 0.027

5 35000 93/18 0.0326 50000 80/80 0.026

7 45000 90/18 0.0298 50000 96/19 0.034

9 18% 80/4 0.01110 41% 80/34 0.02211 62% 80/80 0.026

Table 3: JSMA on the image classifier

∆ SR % PDR InstancesFAIL:Unknown features

39% 87/63/67 0.93/0.96/0.96 8/4/1066% 84/71/74 0.94/0.95/0.95 8/4/9

FAIL:Unknown algorithmshallow 83/65/68 0.97/0.97/0.96 17/14/15narrow 75/67/72 0.96/0.97/0.96 20/16/17

FAIL:Unavailable training set35000 73/68/76 0.97/0.96/0.96 17/16/1450000 78/70/74 0.97/0.97/0.97 18/16/15

FAIL:Unknown training set45000 82/69/74 0.98/0.96/0.96 16/10/1550000 70/62/68 0.95/0.96/0.96 17/8/17

FAIL:Read-only features25% 80/70/72 0.97/0.97/0.97 19/16/1550% 80/71/76 0.97/0.97/0.97 18/16/1375% 83/78/79 0.97/0.97/0.96 16/16/12

Table 4: StingRay on the image classifier

∆ SR % PDR Instances

109066 79/3/5 0.99/0.99/1.00 73/50/53327199 77/12/13 0.99/0.99/1.00 51/50/15

SGD 42/33/42 0.99/0.99/0.99 65/50/31dSVM 38/35/48 0.99/0.99/0.99 78/50/61

8514 69/27/27 0.90/0.99/0.99 57/50/4285148 50/50/50 0.99/0.99/0.99 77/50/61

8514 53/21/24 0.93/0.99/1.00 62/50/4943865 36/29/39 1.04/0.99/0.99 100/50/87

851 73/12/13 0.67/0.99/1.00 50/50/108514 49/16/17 0.90/0.99/1.00 61/50/4785148 32/32/32 0.99/0.99/0.99 79/50/57

Table 5: StingRay on the malware classifier

Tables 3, 4, 5: FAIL analysis of the two applications. For each JSMA experiment, we report the attack SR (perceived/potential), as well as themean perturbation τD introduced to the evasion instances. For each StingRay experiment, we report the SR and PDR (perceived/actual/potential), aswell as statistics for the crafted instances on successful attacks (mean/median/standard deviation). ∆ represents the variation of the FAIL dimensioninvestigated.

well as for crafting instances. On the victim classifier,the cropped part of the images is added back without al-tering the perturbations.

With limited knowledge along this dimension (#1-2)the perceived success remains high, but the actual SRis very low. This suggests that the evasion attacks arevery sensitive in such scenarios, highlighting a potentialdirection for future defenses.

We then model an attacker with limited Algorithmknowledge, possessing a similar architecture, but withsmaller network capacity. For the shallow network (#3)the attacker network has one less hidden layer; the nar-row architecture (#4) has half of the original number ofneurons in the fully connected hidden layers. Here weobserve that the shallow architecture (#3) renders almostall attacks as successful on the attacker. However, thepotential SR on the victim is higher for the narrow setup(#4). This contradicts claims in prior work [37], whichstate that the used architecture is not a factor for success.Instance knowledge. In #5 we simulate a scenario inwhich the attacker only knows 70% of the victim trainingset, while #7-8 model an attacker with 80% of the train-ing set available and an additional subset of instancessampled from the same distribution.

These results might help us explain the contradictionwith prior work. Indeed, we observe that a robust at-tacker classifier, trained on a sizable data set, reduces theSR to 19%, suggesting that the attack success sharplydeclines with fewer victim training instances available.In contrast, in [37] the SR remains at over 80% becauseof the non-random data-augmentation technique used tobuild the attacker training set. As a result, the attackermodel is a closer approximation of the victim one, im-pacting the analysis along the A dimension.

Experiments #9–11 model the case where the attackerhas limited Leverage and is unable to modify someof the instance features. This could represent a regionwhere watermarks are added to images to check their in-tegrity. We simulate it by considering a border in the im-age from which the modified pixels would be discarded,corresponding to the attacker being able modify to 18%,41% and 62% of an image respectively. We observe asignificant drop in transferability, although #11 showsthat the SR is not reduced with leverage above a certainthreshold.

StingRay on the image classifier. We now evaluate thepoisoning attack described in 4.2 under the same scenar-ios defined above. Table 4 summarizes our results. Incontrast to evasion, the table reports the SR, PDR, andthe number of poison instances needed. Here, besidesthe perceived and potential statistics, we also report theactual SR and PDR (as reflected on the victim when trig-gering only the attacks perceived successful).

For limited Feature knowledge, we observe that theperceived SR is over 84% but the actual success ratedrops significantly on the victim. However, the actualSR for #2 is similar to the white-box attacker (#6), show-ing that features derived from the exterior regions of animage are less specific to an instance. This suggests thatalthough reducing feature knowledge decreases the ef-fectiveness of StingRay, the specificity of some knownfeatures may still enable successful attacks.

Along the A dimension, we observe that both archi-tectures allow the attacker to accurately approximate thedeep space distance between instances. While the per-ceived SR is overestimated, the actual SR of these attacksis comparable to the white-box attack, showing that ar-


(a) Limited Feature knowledge. (b) Limited Leverage.

Figure 3: Example of original and crafted images. Images in the leftpanel are crafted with 39% and 66% of features unknown. In the rightpanel, the images are crafted with 100% and 50% leverage.

chitecture secrecy does not significantly increase the re-silience against these attacks. The open-source neuralnetwork architectures readily available for many of clas-sification tasks would aid the adversary. Along the I di-mension, in #5, the PDR is increased because the smalleravailable training set size prevents them from traininga robust classifier. In the white-box attack #6 we ob-serve that the perceived, actual and potential SRs are dif-ferent. We determined that this discrepancy is causedby documented nondeterminism in the implementationframework. This affects the order in which instancesare processed, causing variance on the model parameters,which in turns reflects on the effectiveness of poisoninginstances. Nevertheless, we observe that the potentialSR is higher in #5, even though the amount of availableinformation is larger in #6. This highlights the benefit ofa fine-grained analysis along all dimensions, since theattack success rate may not be monotonic in terms ofknowledge levels.

Surprisingly, we observe that the actual SR for #8,where the attacker has more training instances at theirdisposal, is lower than for #7. This is likely caused bythe fact that, with a larger discrepancy between the train-ing sets of the victim and the attacker classifier, the at-tacker is more likely to select base instances that wouldnot be present in the victim training set. After poison-ing the victim, the effect of crafted instances would notbe bootstrapped by the base instances, and the attackerfails. The results suggest that the attack is sensitive tothe presence of specific pristine instances in the trainingset, and variance in the model parameters could mitigatethe threat. However, determining which instances shouldbe kept secret is subject for future research.

Limited Leverage increases the actual SR beyond thewhite-box attack. When discarding modified pixels, theoverall perturbation is reduced. Thus, it is more likelythat the poison samples will become collectively incon-spicuous, increasing the attack effectiveness. Figure 3 il-lustrates some images crafted by constrained adversaries.

The FAIL analysis results show that the perceivedPDR is generally an accurate representation of the ac-tual value, making it easy for the adversary to assess theinstance inconspicuousness and indiscriminate damagecaused by the attack. The attacks transfer surprisinglywell from the attacker to the victim, and a significant

number of failed attacks would potentially be successfulif triggered on the victim. We observe that limited lever-age allows the attacker to localize their strategy, craftingattack instances that are even more successful than thewhite-box attack.

StingRay on the malware classifier. In order to evalu-ate StingRay in the FAIL setting on the malware classi-fier, we trigger all 1,717 attacks described in 4.2 along 11dimensions. Table 5 summarizes the results. Experiment#6 corresponds to the white-box attacker.

Experiments #1–2 look at the case where Features areunknown to the adversary. In this case, the surrogatemodel used to craft poison instances includes only 20%and 60% of the features respectively. Surprisingly, theattack is highly ineffective. Although the attacker per-ceives the attack as successful in some cases, the clas-sifier trained on the available feature subspace is a veryinaccurate approximation of the original one, resultingin an actual SR of at most 12%. These results echothese from evasion, indicating that features secrecy mightprove a viable lead towards effective defenses. We alsoinvestigate adversaries with various degrees of knowl-edge about the classification Algorithm. Experiment #3trains a linear model using the Stochastic Gradient De-scent (SGD) algorithm, and in #4 (dSVM), the hyper-parameters of the SVM classifier are not known by theattacker. Compared with the original Drebin SVM clas-sifier, the default setting in #4 uses a larger regulariza-tion parameter. This suggests that regularization can helpmitigate the impact of individual poison instances, butthe adversary may nevertheless be successful by inject-ing more crafted instances in the training set.

Instance knowledge. Experiments #5–6 look at a sce-nario in which the known instances are subsets of S∗.Unsurprisingly, the attack is more effective as more in-stances from S∗ become available. The attacker’s in-ability to train a robust surrogate classifier is reflectedthrough the large perceived PDR. For experiments #7–8, victim training instances are not available to the at-tacker, their classifier being trained on samples from thesame underlying distribution as S∗. Under these con-straints, the adversary could only approximate the effectof the attack on the targeted classifier. Additionally, thetraining instances might be significantly different thanthe base instances available to the adversary, cancelingthe effect of crafted instances. The results show, as inthe case of the image classifier, that poison instances arehighly dependent on other instances present in the train-ing set to bootstrap their effect on target misclassifica-tion. We further look at the impact of limited Lever-age on the attack effectiveness. Experiments #9–11 lookat various training set sizes for the case where only thefeatures extracted from AndroidManifest.xml are modifi-


StingRay RONI tRONI MM∣I∣/SR%/PDR Fix%/PDR

Images 16/70/0.97 -/- -/- -/-Malware 77/50/0.99 0/0.98 15/0.98 -/-Exploits 7/6/1.00 0/0.97 40/0.67 0/0.33Breach 18/34/0.98 -/- 20/0.96 55/0.91

Table 6: Effectiveness of StingRay and of existing defenses againstit on all applications. Each attack cell reports the average number ofpoison instances ∣I∣, the SR and actual PDR. Each defense cell reportsthe percentage of fixed attacks and the PDR after applying it.

able. These features correspond to approximately 40%of the 545,333 existing features. Once again, we observethat the effectiveness of a constrained attacker is reduced.This signals that a viable defense could be to extract fea-tures from uncorrelated sources, which would limit theleverage of such an attacker.

The FAIL analysis on the malware classifier revealsthat the actual drop in performance of the attacks is in-significant on all dimensions, but the attack effectivenessis generally decreased for weaker adversaries. However,feature secrecy and limited leverage appear to have themost significant effect on decreasing the success rate,hinting that they might be a viable defense.

5.2 Effectiveness of StingRay

In this section we explore the effectiveness of StingRayacross all applications described in 4.2 and compare ex-isting defense mechanisms in terms of their ability to pre-vent the targeted mispredictions. Table 6 summarizes ourfindings. Here we only consider the strongest (white-box) adversary to determine upper bounds for the re-silience against attacks, without assuming any degree ofsecrecy.Image classifier. We observe that the attack is success-ful in 70% of the cases and yields an average PDR of0.97, requiring an average of 16 instances. Upon furtheranalysis, we discovered that the performance drop is dueto other testing instances similar to the target being mis-classified as yd . By tuning the attack parameters (e.g.the layer used for comparing features or the degree ofallowed perturbation) to generate poison instances thatare more specific to the target, the performance drop onthe victim could be further reduced at the expense ofrequiring more poisoning instances. Nevertheless, thisshows that neural nets define a fine-grained boundary be-tween class-targeted and instance-targeted poisoning at-tacks and that it is not straightforward to discover it, evenwith complete adversarial knowledge.

None of the three poisoning defenses are applicable onthis task. RONI and tRONI require training over 50,000classifiers for each level of inspected negative impact.

This is prohibitive for neural networks which are knownto be computationally intensive to train. Since we couldnot determine reliable timestamps for the images in thedata set, MM was not applicable either.Malware classifier. StingRay succeeds in half of thecases and yields a negligible performance drop on thevictim. The attack being cut off by the crafting budgeton most failures (Cost B.VII) suggests that some targetsmight be too ”far” from the class separator and that mov-ing this separator becomes difficult. Nevertheless, un-derstanding what causes this hardness remains an openquestion.

On defenses, we observe that RONI often fails tobuild correctly-predicting folds on Drebin and times out.Hence, we investigate the defenses against only 97 suc-cessful attacks for which RONI did not timeout. MMrejects all training instances while RONI fails to detectany attack instances. tRONI detects very few poison in-stances, fixing only 15% of attacks, as they do not have alarge negative impact, individually, on the misclassifica-tion of the target. None of these defenses are able to fixa large fraction of the induced mispredictions.Exploit predictor. While poisoning a small number ofinstances, the attack has a very low success rate. This isdue to the fact that the non-Twitter features are not mod-ifiable; if the data set does not contain other vulnerabili-ties similar to the target (e.g. similar product or type), theattack would need to poison more CVEs, reaching Nmaxbefore succeeding. The result, backed by our FAIL anal-ysis of the other linear classifier in Section 5.1, highlightsthe benefits of built-in leverage limitations in protectingagainst such attacks.

MM correctly identifies the crafted instances but alsomarks a large fraction of positively-labeled instances assuspicious. Consequently, the PDR on the classifier isseverely affected. In instances where it does not timeout,RONI fails to mark any instance. Interestingly, tRONImarks a small fraction of attack instances which helpscorrect 40% of the predictions but still hurting the PDR.The partial success of tRONI is due to two factors: thesmall number of instances used in the attack and the lim-ited leverage for the attacker, which boosts the negativeimpact of attack instances through resampling. We ob-served that due to variance, the negative impact com-puted by tRONI is larger than the one perceived by theattacker for discovered instances. The adversary couldadapt by increasing the confidence level of the statisticthat reflects negative impact in the StingRay algorithm.Data breach predictor: The attacks for this applicationcorrespond to two scenarios, one with limited leverageover the number of time series features. Indeed, the onein which the attacker has limited leverage has an SR of5%, while the other one has an SR of 63%. This cor-roborates our observation of the impact of adversarial


leverage for the exploit prediction. RONI fails due toconsistent timeouts in training the random forest classi-fier. tRONI fixes 20% of the attacks while decreasing thePDR slightly. MM is a natural fit for the features basedon time series and is able to build more balanced votingfolds. The defense fixes 55% of mispredictions, at theexpense of lowering the PDR to 0.91.

Our results suggest that StingRay is practical against avariety of classification tasks—even with limited degreesof leverage. Existing defenses, where applicable, are eas-ily bypassed by lowering the required negative impact ofcrafted instances. However, the reduced attack successrate on applications with limited leverage suggests newdirections for future defenses.

6 Related Work

Several studies proposed ways to model adversariesagainst machine learning systems. [25] proposes FTC—features, training set, and classifier, a model to de-fine an attacker’s knowledge and capabilities in the caseof a practical evasion attack. Unlike the FTC model,the FAIL model is evaluated on both test- and training-time attacks, enables a fine-grained analysis of the di-mensions and includes Leverage. These characteristicsallow us to better understand how the F and L dimen-sions influence the attack success. Furthermore, [27, 7]introduce game theoretical Stackelberg formulations forthe interaction between the adversary and the data minerin the case of data manipulations. Adversarial limita-tions are also discussed in [22]. Several attacks againstmachine learning consider adversaries with varying de-grees of knowledge, but they do not cover the wholespectrum [4, 35, 37]. Recent studies investigate transfer-ability, in attack scenarios with limited knowledge aboutthe target model [36, 28, 9]. The FAIL model unifiesthese dimensions and can be used to model these capabil-ities systematically across multiple attacks under realisticassumptions about adversaries. Unlike game theoreticalapproaches, FAIL does not assume perfect knowledge oneither the attacker or the defender. By defining a widerspectrum of adversarial knowledge, FAIL generalizes thenotion of transferability.

Prior work introduced indiscriminate and targeted poi-soning attacks. For indiscriminate poisoning, a spam-mer can force a Bayesian filter to misclassify legitimateemails by including a large number of dictionary wordsin spam emails, causing the classifier to learn that all to-kens are indicative of spam [3] An attacker can degradethe performance of a Twitter-based exploit predictor byposting fraudulent tweets that mimic most of the featuresof informative posts [40]. One could also the damageoverall performance of an SVM classifier by injecting asmall volume of crafted attack points [5]. For targeted

poisoning, a spammer can trigger the filter against a spe-cific legitimate email by crafting spam emails resemblingthe target [34]. This was also studied in the healthcarefield, where an adversary can subvert the predictions fora whole target class of patients by injecting fake patientdata that resembles the target class [32]. StingRay is amodel-agnostic targeted poisoning attack and works ona broad range of applications. Unlike existing targetedpoisoning attacks, StingRay aims to bound indiscrimi-nate damage to preserve the overall performance.

On neural networks, [23] proposes a targeted poison-ing attack that modifies training instances which havea strong influence on the target loss. In [51], thepoisoning attack is a white-box indiscriminate attackadapted from existing evasion work. Furthermore, [29]and [20] introduce backdoor and trojan attacks where ad-versaries cause the classifiers to misbehave when a trig-ger is present in the input. The targeted poisoning at-tack proposed in [11] requires the attacker to assign la-bels to crafted instances. Unlike these attacks, StingRaydoes not require white-box or query access the originalmodel. Our attack does not require control over the la-beling function or modifications to the target instance.

7 Discussion

The vulnerability of ML systems to evasion and poi-soning attacks leads to an arms race, where defensesthat seem promising are quickly thwarted by new at-tacks [17, 37, 38, 9]. Previous defenses make implicit as-sumptions about how the adversary’s capabilities shouldbe constrained to improve the system’s resilience to at-tacks. The FAIL adversary model provides a frameworkfor exposing and systematizing these assumptions. Forexample, the feature squeezing defense [49] constrainsthe adversary along the A and F dimensions by modify-ing the input features and adding an adversarial exam-ple detector. Similarly, RONI constrains the adversaryalong the I dimension by sanitizing the training data.The ML-based systems employed in the security indus-try [21, 10, 39, 12], often rely on undisclosed featuresto render attacks more difficult, thus constraining the Fdimension. In Table 2 we highlight implicit and explicitassumptions of previous defenses against poisoning andevasion attacks.

Through our systematic exploration of the FAIL di-mensions, we provide the first experimental comparisonof the importance of these dimensions for the adversary’sgoals, in the context of targeted poisoning and evasion at-tacks. For a linear classifier, our results suggest that fea-ture secrecy is the most promising direction for achievingattack resilience. Additionally, reducing leverage can in-crease the cost for the attacker. For a neural networkbased image recognition system, our results suggest that


StingRay’s samples are transferable across all dimen-sions. Interestingly, limiting the leverage causes the at-tacker to craft instances that are more potent in triggeringthe attack. We also observed that secrecy of training in-stances provides limited resilience.

Furthermore, we demonstrated that the FAIL adver-sary model is applicable to targeted evasion attacks aswell. By systemically capturing an adversary’s knowl-edge and capabilities, the FAIL model also defines amore general notion of attack transferability. In additionto investigating transferability under certain dimensions,such as the A dimension in [9] or A and I dimensionsin [37], generalized transferability covers a broader rangeof adversaries. At odds with the original findings in [37],our results suggest a lack of generalized-transferabilityfor a state of the art evasion attack; while highlightingfeature secrecy as the most prominent factor in reduc-ing the attack success rate. Future research may utilizethis framework as a vehicle for reasoning about the mostpromising directions for defending against other attacks.

Our results also provide new insights for the broaderdebate about the generalization capabilities of neural net-works. While neural networks have dramatically re-duced test-time errors for many applications, which sug-gests they are capable of generalization (e.g. by learn-ing meaningful features from the training data), recentwork [53] has shown that neural networks can also mem-orize randomly-labeled training data (which lack mean-ingful features). We provide a first step toward under-standing the extent to which an adversary can exploit thisbehavior through targeted poisoning attacks. Our resultsare consistent with the hypothesis that an attack, such asStingRay, can force selective memorization for a targetinstance while preserving the generalization capabilitiesof the model. We leave testing this hypothesis rigorouslyfor future work.

8 Conclusions

We introduce the FAIL model, a general framework forevaluating realistic attacks against machine learning sys-tems. We also propose StingRay, a targeted poisoning at-tack designed to bypass existing defenses. We show thatour attack is practical for 4 classification tasks, which use3 different classifiers. By exploring the FAIL dimen-sions, we observe new transferability properties in ex-isting targeted evasion attacks and highlight characteris-tics that could provide resiliency against targeted poison-ing. This exploration generalizes the prior work on attacktransferability and provides new results on the transfer-ability of poison samples.

Acknowledgments We thank Ciprian Baetu, JonathanKatz, Daniel Marcu, Tom Goldstein, Michael Maynord,Ali Shafahi, W. Ronny Huang, our shepherd, Patrick Mc-Daniel and the anonymous reviewers for their feedback.We also thank the Drebin authors for giving us access totheir data set and VirusTotal for access to their service.This research was partially supported by the Departmentof Defense and the Maryland Procurement Office (con-tract H98230-14-C-0127).

References[1] ALEXEY MALANOV 12 POSTS MALWARE EXPERT, ANTI-

MALWARE TECHNOLOGIES DEVELOPMENT, K. L. The mul-tilayered security model in kaspersky lab products, Mar 2017.

[2] ARP, D., SPREITZENBARTH, M., HUBNER, M., GASCON, H.,AND RIECK, K. Drebin: Effective and explainable detection ofandroid malware in your pocket. In NDSS (2014).

[3] BARRENO, M., NELSON, B., JOSEPH, A. D., AND TYGAR,J. D. The security of machine learning. Machine Learning 81(2010), 121–148.

[4] BIGGIO, B., CORONA, I., MAIORCA, D., NELSON, B.,SRNDIC, N., LASKOV, P., GIACINTO, G., AND ROLI, F. Eva-sion attacks against machine learning at test time. In Joint Euro-pean Conference on Machine Learning and Knowledge Discov-ery in Databases (2013), Springer, pp. 387–402.

[5] BIGGIO, B., NELSON, B., AND LASKOV, P. Poisoning attacksagainst support vector machines. arXiv preprint arXiv:1206.6389(2012).

[6] BOJARSKI, M., YERES, P., CHOROMANSKA, A., CHOROMAN-SKI, K., FIRNER, B., JACKEL, L., AND MULLER, U. Explain-ing how a deep neural network trained with end-to-end learningsteers a car. arXiv preprint arXiv:1704.07911 (2017).

[7] BRUCKNER, M., AND SCHEFFER, T. Stackelberg games foradversarial prediction problems. In Proceedings of the 17th ACMSIGKDD international conference on Knowledge discovery anddata mining (2011), ACM, pp. 547–555.

[8] CARLINI, N., AND WAGNER, D. Adversarial examples are noteasily detected: Bypassing ten detection methods. In Proceedingsof the 10th ACM Workshop on Artificial Intelligence and Security(2017), ACM, pp. 3–14.

[9] CARLINI, N., AND WAGNER, D. Towards evaluating the robust-ness of neural networks. In Security and Privacy (SP), 2017 IEEESymposium on (2017), IEEE, pp. 39–57.

[10] CHAU, D. H. P., NACHENBERG, C., WILHELM, J., WRIGHT,A., AND FALOUTSOS, C. Polonium : Tera-scale graph min-ing for malware detection. In SIAM International Conference onData Mining (SDM) (Mesa, AZ, April 2011).

[11] CHEN, X., LIU, C., LI, B., LU, K., AND SONG, D. TargetedBackdoor Attacks on Deep Learning Systems Using Data Poison-ing. ArXiv e-prints (Dec. 2017).

[12] COLVIN, R. Stranger danger - introducing smartscreenapplication reputation. http://blogs.msdn.com/b/ie/archive/2010/10/13/stranger-danger-introducing-

smartscreen-application-reputation.aspx, Oct 2010.

[13] CRETU, G. F., STAVROU, A., LOCASTO, M. E., STOLFO, S. J.,AND KEROMYTIS, A. D. Casting out demons: Sanitizing train-ing data for anomaly sensors. In Security and Privacy, 2008. SP2008. IEEE Symposium on (2008), IEEE, pp. 81–95.


http://blogs.msdn.com/b/ie/archive/2010/10/13/stranger-danger-introducing-smartscreen-application-reputation.aspx



[14] ERNST YOUNG LIMITED. The future of underwriting. http:

//www.ey.com/us/en/industries/financial-services/insurance/ey-the-future-of-underwriting, 2015.

[15] FAIR ISAAC CORPORATION. FICO enterprise security scoregives long-term view of cyber risk exposure, November 2016.ttp://www.fico.com/en/newsroom/fico-enterprise-security-score-gives-long-term-view-of-cyber-

risk-exposure-10-27-2016.

[16] GILAD-BACHRACH, R., DOWLIN, N., LAINE, K., LAUTER,K., NAEHRIG, M., AND WERNSING, J. Cryptonets: Applyingneural networks to encrypted data with high throughput and accu-racy. In International Conference on Machine Learning (2016),pp. 201–210.

[17] GOODFELLOW, I. J., SHLENS, J., AND SZEGEDY, C. Ex-plaining and harnessing adversarial examples. arXiv preprintarXiv:1412.6572 (2014).

[18] GOOGLE RESEARCH BLOG. Assisting pathologistsin detecting cancer with deep learning. https:

//research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html, Mar 2017.

[19] GROSSE, K., PAPERNOT, N., MANOHARAN, P., BACKES,M., AND MCDANIEL, P. Adversarial perturbations againstdeep neural networks for malware classification. arXiv preprintarXiv:1606.04435 (2016).

[20] GU, T., DOLAN-GAVITT, B., AND GARG, S. Badnets: Identi-fying vulnerabilities in the machine learning model supply chain.arXiv preprint arXiv:1708.06733 (2017).

[21] HEARN, M. Abuse at scale. In RIPE 64 (Ljublijana, Slovenia,Apr 2012). https://ripe64.ripe.net/archives/video/25/.

[22] HUANG, L., JOSEPH, A. D., NELSON, B., RUBINSTEIN, B. I.,AND TYGAR, J. Adversarial machine learning. In Proceedingsof the 4th ACM workshop on Security and artificial intelligence(2011), ACM, pp. 43–58.

[23] KOH, P. W., AND LIANG, P. Understanding black-box predic-tions via influence functions. arXiv preprint arXiv:1703.04730(2017).

[24] KRIZHEVSKY, A., AND HINTON, G. Learning multiple layersof features from tiny images. Citeseer (2009).

[25] LASKOV, P., ET AL. Practical evasion of a learning-based clas-sifier: A case study. In Security and Privacy (SP), 2014 IEEESymposium on (2014), IEEE, pp. 197–211.

[26] LECUN, Y. The mnist database of handwritten digits.http://yann.lecun.com/exdb/mnist/ (1998).

[27] LIU, W., AND CHAWLA, S. A game theoretical model for ad-versarial learning. In Data Mining Workshops, 2009. ICDMW’09.IEEE International Conference on (2009), IEEE, pp. 25–30.

[28] LIU, Y., CHEN, X., LIU, C., AND SONG, D. Delving intotransferable adversarial examples and black-box attacks. arXivpreprint arXiv:1611.02770 (2016).

[29] LIU, Y., MA, S., AAFER, Y., LEE, W.-C., ZHAI, J., WANG,W., AND ZHANG, X. Trojaning attack on neural networks. Tech.Rep. 17-002, Purdue University, 2017.

[30] LIU, Y., SARABI, A., ZHANG, J., NAGHIZADEH, P., KARIR,M., BAILEY, M., AND LIU, M. Cloudy with a chance of breach:Forecasting cyber security incidents. In 24th USENIX SecuritySymposium (USENIX Security 15) (2015), pp. 1009–1024.

[31] MIT TECHNOLOGY REVIEW. How to upgrade judges withmachine learning. https://www.technologyreview.com/s/603763/how-to-upgrade-judges-with-machine-

learning/, Mar 2017.

[32] MOZAFFARI-KERMANI, M., SUR-KOLAY, S., RAGHU-NATHAN, A., AND JHA, N. K. Systematic poisoning attacks onand defenses for machine learning in healthcare. IEEE journal ofbiomedical and health informatics 19, 6 (2015), 1893–1905.

[33] MUNOZ-GONZALEZ, L., BIGGIO, B., DEMONTIS, A., PAU-DICE, A., WONGRASSAMEE, V., LUPU, E. C., AND ROLI,F. Towards poisoning of deep learning algorithms with back-gradient optimization. In Proceedings of the 10th ACM Workshopon Artificial Intelligence and Security (2017), ACM, pp. 27–38.

[34] NELSON, B., BARRENO, M., CHI, F. J., JOSEPH, A. D., RU-BINSTEIN, B. I. P., SAINI, U., SUTTON, C., TYGAR, J. D.,AND XIA, K. Exploiting machine learning to subvert your spamfilter. In Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats (Berkeley, CA, USA, 2008),LEET’08, USENIX Association, pp. 7:1–7:9.

[35] PAPERNOT, N., MCDANIEL, P., JHA, S., FREDRIKSON, M.,CELIK, Z. B., AND SWAMI, A. The limitations of deep learn-ing in adversarial settings. In 2016 IEEE European Symposiumon Security and Privacy (EuroS&P) (2016), IEEE, pp. 372–387.

[36] PAPERNOT, N., MCDANIEL, P. D., AND GOODFELLOW, I. J.Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. CoRR abs/1605.07277(2016).

[37] PAPERNOT, N., MCDANIEL, P. D., GOODFELLOW, I. J., JHA,S., CELIK, Z. B., AND SWAMI, A. Practical black-box attacksagainst deep learning systems using adversarial examples. InACM Asia Conference on Computer and Communications Secu-rity (Abu Dhabi, UAE, 2017).

[38] PAPERNOT, N., MCDANIEL, P. D., WU, X., JHA, S., ANDSWAMI, A. Distillation as a defense to adversarial perturbationsagainst deep neural networks. In IEEE Symposium on Securityand Privacy (2016), IEEE Computer Society, pp. 582–597.

[39] RAJAB, M. A., BALLARD, L., LUTZ, N., MAVROMMATIS, P.,AND PROVOS, N. CAMP: Content-agnostic malware protection.In Network and Distributed System Security (NDSS) Symposium(San Diego, CA, Feb 2013).

[40] SABOTTKE, C., SUCIU, O., AND DUMITRA, T. Vulnerabilitydisclosure in the age of social media: exploiting twitter for pre-dicting real-world exploits. In 24th USENIX Security Symposium(USENIX Security 15) (2015), pp. 1041–1056.

[41] SAINI, U. Machine learning in the presence of an adversary:Attacking and defending the spambayes spam filter. Tech. rep.,DTIC Document, 2008.

[42] STEINHARDT, J., KOH, P. W. W., AND LIANG, P. S. Certi-fied defenses for data poisoning attacks. In Advances in NeuralInformation Processing Systems (2017), pp. 3520–3532.

[43] SUCIU, O., MARGINEAN, R., KAYA, Y., DAUME III, H., ANDDUMITRAS, T. When does machine learning fail? generalizedtransferability for evasion and poisoning attacks. arXiv preprintarXiv:1803.06975 (2018).

[44] SZEGEDY, C., ZAREMBA, W., SUTSKEVER, I., BRUNA, J.,ERHAN, D., GOODFELLOW, I., AND FERGUS, R. Intriguingproperties of neural networks. arXiv preprint arXiv:1312.6199(2013).

[45] TAMERSOY, A., ROUNDY, K., AND CHAU, D. H. Guilt by as-sociation: large scale malware detection by mining file-relationgraphs. In KDD (2014).

[46] TRAMER, F., ZHANG, F., JUELS, A., REITER, M., AND RIS-TENPART, T. Stealing machine learning models via predictionAPIs. In 25th USENIX Security Symposium (USENIX Security16) (Austin, TX, Aug. 2016), USENIX Association.


http://www.ey.com/us/en/industries/financial-services/insurance/ey-the-future-of-underwriting



ttp://www.fico.com/en/newsroom/fico-enterprise-security-score-gives-long-term-view-of-cyber-risk-exposure-10-27-2016



https://research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html



https://ripe64.ripe.net/archives/video/25/

https://ripe64.ripe.net/archives/video/25/

https://www.technologyreview.com/s/603763/how-to-upgrade-judges-with-machine-learning/



[47] VERIZON. Data breach investigations reports (dbir), February2012. http://www.verizonenterprise.com/DBIR/.

[48] VIRUSTOTAL. http://www.virustotal.com.

[49] XU, W., EVANS, D., AND QI, Y. Feature squeezing: Detect-ing adversarial examples in deep neural networks. arXiv preprintarXiv:1704.01155 (2017).

[50] XU, W., QI, Y., AND EVANS, D. Automatically evading classi-fiers. In Proceedings of the 2016 Network and Distributed Sys-tems Symposium (2016).

[51] YANG, C., WU, Q., LI, H., AND CHEN, Y. Generative poi-soning attack method against neural networks. arXiv preprintarXiv:1703.01340 (2017).

[52] YOSINSKI, J., CLUNE, J., BENGIO, Y., AND LIPSON, H. Howtransferable are features in deep neural networks? In Advancesin neural information processing systems (2014), pp. 3320–3328.

[53] ZHANG, C., BENGIO, S., HARDT, M., RECHT, B., ANDVINYALS, O. Understanding deep learning requires rethinkinggeneralization. arXiv preprint arXiv:1611.03530 (2016).

AppendixA The StingRay Attack

Algorithm 1 shows the pseudocode of StingRay’s twogeneral-purpose procedures. STINGRAY builds a set Iwith at least Nmin and at most Nmax attack instances. Inthe sample crafting loop, this procedure invokes GET-BASEINSTANCE to select appropriate base instances forthe target. Each iteration of the loop crafts one poisoninstance by invoking CRAFTINSTANCE, which modifiesthe set of allowable features (according to FAIL’s L di-mension) of the base instance. This procedure is specificto each application. The other application-specific ele-ments are the distance function D and the method forinjecting the poison in the training set: the crafted in-stances may either replace or complement the base in-stances, depending on the application domain. Next, wedescribe the steps that overcome the main challenges oftargeted poisoning.

Application-specific instance modification. CRAFTIN-STANCE crafts a poisoning instance by modifying the setof allowable features of the base instance. The procedureselects a random sample among these features, under theconstraint of the target resemblance budget. It then al-ters these features to resemble those of the target. Eachcrafted sample introduces only a small perturbation thatmay not be sufficient to induce the target misclassifica-tion; however, because different samples modify differ-ent features, they collectively teach the classifier that thefeatures of t correspond to label yd . We discuss the im-plementation details of this procedure for the four appli-cations in Section 4.2.

Crafting individually inconspicuous samples. To en-sure that the attack instances do not stand out from the

rest of the training set, GETBASEINSTANCE randomlyselects a base instance from S′, labeled with the desiredtarget class yd , that lies within τD distance from the tar-get. By choosing base instances that are as close to thetarget as possible, the adversary reduces the risk that thecrafted samples will become outliers in the training set.The adversary can further reduce this risk by trading tar-get resemblance (modifying fewer features in the craftedsamples) for the need to craft more poison samples (in-creasing Nmin). The adversary then checks the negativeimpact of the crafted instance on the training set sam-ple S′. The crafted instance xc is discarded if it changesthe prediction on t above the attacker set threshold τNI oradded to the attack set otherwise. To validate that thesetechniques result in individually inconspicuous samples,we consider whether our crafted samples would be de-tected by three anti-poisoning defenses, discussed in de-tail in Section 4.1.

Crafting collectively inconspicuous samples. After thecrafting stage, GETPDR checks the perceived PDR onthe available classifier. The attack is considered success-ful if both adversarial goals are achieved: changing theprediction of the available classifier and not decreasingthe PDR below a desired threshold τPDR.

Guessing the labels of the crafted samples. By modi-fying only a few features in crafted sample, CRAFTIN-STANCE aims to preserve the label yd of the base in-stance. While the adversary is unable to dictate how thepoison samples will be labeled, they might guess this la-bel by consulting an oracle. We discuss the effectivenessof this technique in Section 4.3.


http://www.verizonenterprise.com/DBIR/

http://www.virustotal.com

When Does Machine Learning FAIL? Generalized ... · Octavian Suciu Radu M˘arginean Yi gitcan Kaya...

Documents

Transcript of When Does Machine Learning FAIL? Generalized ... · Octavian Suciu Radu M˘arginean Yi gitcan Kaya...