A Dynamic Approach for Characterizing Collusion …jon/papers/inria_collusion.pdfA Dynamic Approach...

A Dynamic Approach for Characterizing Collusion in Desktop Grids

Louis-Claude Canon∗§, Emmanuel Jeannot†§, Jon Weissman‡∗Nancy University, LORIA and LaBRI

Nancy & Bordeaux, FranceEmail: [email protected]

†INRIA, LORIA, LaBRINancy & Bordeaux, France

Email: [email protected]‡Dept. of Computer Science and Engineering

University of Minnesota, Twin CitiesMinneapolis, USA

Email: [email protected]

Abstract—By exploiting idle time on volunteer ma-chines, desktop grids provide a way to execute largesets of tasks with negligible maintenance and low cost.Although desktop grids are attractive for cost-consciousprojects, relying on external resources may compromisethe correctness of application execution due to the well-known unreliability of nodes. In this paper, we considerthe most challenging threat model: organized groups ofcheaters that may collude to produce incorrect results.We propose two on-line algorithms for detecting collusionand characterizing the participant behaviors. Using severalreal-life traces, we show that our approach is accurate andefficient in identifying collusion and in estimating groupbehavior.

Keywords-Desktop Grid; Collusion; Modeling; Sabotage

I. INTRODUCTION

Desktop grids have proven to be a useful platformfor many long-running scientific and engineering appli-cations [1]–[4]. The attractive features of this paradigminclude nearly infinite scale-out, low cost of deploy-ment and management, and simplicity. However, thesebenefits do not come for free. The wide-dispersion,autonomous, and uncontrolled nature of the system,make this platform inherently unpredictable and unsta-ble. Network links may go up and down, nodes mayleave and join unpredictably including node failure, andnodes may fail to deliver correct and timely results.Much research has focused on how to mask this un-certainty under a set of assumptions, the most commonis that failures are uncorrelated. With this assumption,tasks may be replicated across nodes to improve theirreliability both in terms of timeliness (job completiontime) and correctness (accuracy). General solutions forcorrectness assume a result cannot be certified in iso-lation and requires indirect validation through the use§This work was done while L.C. Canon and E. Jeannot moved from

the LORIA lab. (Nancy) to the LaBRI lab. (Bordeaux).

of pre-computed quizzes [5] or voting coupled withreputation systems [6]. The latter can be effective if theanswer space is extremely large making the probabilitythat two uncorrelated errors produce the same wrongresult neglibly small. However, Internet-scale systemsare rich with examples of correlated misbehavior anderrors including botnet attacks, viruses, worms, sybilattacks, and buggy software distributions, to name afew. We believe that the long-running nature of desktopgrid systems make them particularly vulnerable to suchphenomena.

In this paper, we explore efficient and accurate tech-niques for representing, detecting, and characterizingthe presence of correlated or collusive behavior indesktop grid systems. We distinguish between collu-sive behavior (correlated bad behavior) and agreement(general agreeement between nodes whether good orbad). Both concepts are useful for schedulers that wishto thwart collusion in the system. The principal chal-lenge is to derive trustworthy global knowledge fromunreliable sources. We present efficient data-structuresand procedures that store and update collusion andagreement probabilities across arbitrary node groups.We assume a very aggressive threat model and ourmethod requires only simple observations of workerbehavior to be effective. We assess the accuracy of ourapproach using traces from well-known desktop gridapplications. The results reveal that our approach is bothaccurate (the estimated collusion and agreement proba-bilities are close to the actual values) and efficient (thetime required to compute these values across differentnodes is small). In addition, colluding and non-colludinggroups are accurately identified.

II. RELATED WORK

The goal of reputation systems is to associate repu-tation indices to each known worker. The literature in

this area is rich and one of the best illustrative examplesis the EigenTrust algorithm [7] which can be appliedin desktop grids. Such methods do not target collusiondetection and hence are not robust to an orchestratedattack.

Collusion estimation is also closely related to faultdiagnosis [8] whose goal is to find faulty processorsby performing a series of requests between pairs ofprocessors. This problem is based on the assumptionthat processors are not hostile and do not try to deceivethe detection mechanism. In this paper, we proposemechanisms based on observations that tolerate arbitraryerroneous behaviors.

Considering collusion with scheduling has been ad-dressed by several projects. Zhao, Lo, and Gauthier-Dickey (see [5]) have analyzed two solutions basedon redundancy and spot-checking and provide proba-bilistic analysis for both approaches. With their spot-checking mechanism, quiz tasks with verifiable resultsare inserted in the workload and allow the detection ofcheaters. Ensuring that quiz tasks are indistinguishablefrom regular tasks, however, is a difficult problem.Similarly, Yurkewych, Levine, and Rosenberg propose away to estimate the cost of a redundancy mechanism inthe presence of colluders using game theory (see [9]).In this work, the goal is to determine the probabilityof auditing a task that minimizes the risk of collusion.These two works are based on the possibility of check-ing the result of a job. We do not consider this optionas such a check is not always possible. However, suchapproaches are complementary to ours.

Silaghi et al. in [10] have proposed a technique thatcan thwart collusive behavior. The mechanism is basedon redundancy and a reputation system that uses theEigenTrust algorithm. Workers detected as colludersare blacklisted and their previously computed jobs areresubmitted. However, this work has several drawbacks:it assumes that the detection algorithm is not knownby the worker and it has to wait until the completionof all jobs before certifying the result. In contrastto this work, we propose a public algorithm wherecollusion characterization is improved each time a resultis returned.

Therefore, to the best of our knowledge, this work isthe first to tackle the problem of collusion characteriza-tion in the context of desktop grids using both generaland practical assumptions.

III. MODELS AND DEFINITIONS

A. Platform and Threat Model

We propose the following model of a desktop grid(see Figure 1), directly inspired from BOINC [1]:• We are given a batch of jobs to be executed on the

platform.

Network

Workers

Server

Batch of tasks

Figure 1: A Desktop Grid.

• We have n workers. Each worker w ∈W is able tocompute all jobs in the batch. However, as workersmay come and go they are not always available.This availability is defined by a set of intervals. Ifa worker, computing a job, leaves the system, itresumes job execution when it comes back.

• The server assigns each job to a set of workers andgathers the job results. For each job, we assumethat there is only one correct result. Moreover, theresult space is sufficiently large such that if twoworkers return the same incorrect result this meansthat they have colluded.

Collusion is defined as the cooperation of multipleworkers in order to send the same incorrect result. Wedistinguish between two types of collusion. The first iswhen the saboteurs voluntarily cooperate to send morethan one incorrect result trying to defeat the quorumalgorithm. The other case is where workers do notdeliberately collude as in the case of a virus or abug in the code executed by the workers. Moreoverit is possible that a worker (a colluder or not) maysimply fail to correctly execute the job. To model thesepossibilities, we consider three worker behaviors:• a worker may fail independently of others and

return an incorrect result. The failure probabilityis fixed.

• a worker may belong to the non-colluding group.Such a non colluding worker never colludes withanother worker but may fail. We assume that atleast half of the workers are of this class.

• a worker may belong to one or more colludinggroups. In order to reduce the chance of beingdetected, members of a group will act sometimesas colluders (return the same incorrect result) andsometimes as non-colluding workers (returning thesame correct result). The probability that a groupdecides to collude or not is fixed. Moreover, it ispossible that two different groups of colluders col-lude together, (i.e., two workers from two different

groups may send the same incorrect result with agiven probability). A worker in a colluding groupmay also fail independently (failures are predom-inant over collusion), and it returns an incorrectresult alone.

We want to emphasize that the threat model proposedhere is very strong: a worker may belong to one or moregroups, groups can cooperate, colluders may sometimesend a correct result to stay undetected, colluders arenot required to synchronize to send the same incorrectresult, and none of this information is known a-prioriby the server.

B. Problem Definition

The problem we tackle focuses on characterizingthe collusion properties of the workers, ı.e., for eachworker we estimate the group(s) to which it belongsand collusion probability of each group and betweengroups. We make no assumptions about how jobs aremapped to workers other than jobs will be replicated.Although the way jobs are scheduled to workers cancertainly help to detect and characterize the workerbehaviors, here we make almost no assumption on howjobs are mapped to workers. Indeed, we want to proposea general strategy to solve our problem that must workfor any (reasonable) scheduling algorithm. The onlyexplicit assumption we make is that jobs are replicatedfor the sake of fault tolerance as in BOINC. Hence,the input of our problem is composed of two types ofevents:• < t,w, j, r > at time t, worker w returns result r

for job j (job result).• < t, j > at time t, all the workers assigned to job j

have finished their computation (job completion).We assume that these events arrive in an order dic-

tated by the worker speed, and thus, are time-stamped.

C. Group Model and Metrics

To account for the property that a worker may belongto several groups, we transform the input worker groupsto a new set of groups. In the new groups, each workerbelongs exactly to exactly one group and we update thecollusion probability accordingly. For example, if work-ers w1 and w2 belong to group g and g′ respectively,we put them in a third group g′′ where the probabilityof collusion between g and g′′ is the internal collusionprobability of g. Hence, in the remainder of this paper,we presume that all workers are in exactly one group.

To model reality, we will group workers that share acommon behavior. The output of our collusion charac-terization system will be a set of groups of workers thathave the same collusion characteristics (i.e when a job isassigned to a subset of workers within this group, theyalways send the same results, correct or incorrect). To

measure the accuracy of our solution, we will comparethe estimation with actual collusion probabilities andgroup composition.

We denote by G the real group set and G the esti-mated group set. We assume that both sets are complete(every worker is in exactly one group). Let Ow ∈ Gbe the set of workers in the same group as worker w.We want the observed groups

⋃w∈W Ow to agree with

the real group of colluding workers and the observedprobability to be close to the actual value. Combiningboth of these qualitative and quantitative measures intoone accuracy metric is a difficult challenge. To do this,we proceed as follows. Let g ∈ G and g′ ∈ G representtwo real groups of workers. Let eg∪g′ be the absolutedifference between the estimated probability that allthe workers from set g ∪ g′ return the same incorrectresult for the same job, and the actual probability. Wethen use the RMSD (Root Mean Square Deviation) foraggregating the errors performed for each estimation:

1|G|

√ ∑(g,g′)∈G2

e2g∪g′ (1)

Smaller RMSD signals a more accurate group-basedestimation. How to compute eg∪g′ depends on the waywe internally represent G. We propose two representa-tions of G shortly, and we show how to bound eg∪g′

in section IV-A4. We consider two different aspects ofaccuracy:• convergence time: time needed in order to achieve

a desirable accuracy (defined by a threshold on theRMSD)

• stabilized accuracy: the accuracy achieved after alarge number of events (median RMSD during thelast 100 events of the simulation)

IV. CHARACTERIZING COLLUSION

The characterization and representation of collusionhas several elements. We first present a way to modeland calculate interactions between worker groups, col-luding or non-colluding. The goal is to efficientlyidentify and update worker groups. For example, if allworkers within a group return the same result, we willuse only one entry to update the interactions. We thenpresent the grouping algorithms based on two differentrepresentations, collusion and agreement, that each havedifferent useful properties for schedulers that wish tothwart collusion. The dynamic/on-line algorithm we areproposing provides new estimation each time an eventarrives.

The algorithm groups workers sharing similar char-acteristics. Events are treated as observations aboutthe interactions between the groups (non-colluding orcolluding) of workers. Thus, in some cases, as we work

W Set of workersn Number of workers (n = |W |)G Real set of worker groupsG Observed set of worker groupsC Real collusion matrix (C is of size |G|)C Observed collusion matrix (|C| = |G|)A Observed agreement matrix (|A| = |G|)cij Probability of collusion between group i and j

ciiProbability that workers of group i colludewhen assigned to the same job

cijObserved probability of collusion betweengroup i and j

aijObserved probability of agreement betweengroup i and j

OwSet of workers that are put in the same groupas worker w (Ow ∈ G)

L Subset of workers (L ⊆W )

KLSet of groups covering every workers of L(KL =

Sw∈L Ow)

ELEvent that all workers of L have colluded toreturn the same incorrect result for a given job

Wr,jSet of workers that have returned result rfor job j

Table I: List of symbols

at the group level we will receive redundant informationfrom individual workers. In this case the algorithm willtake care to update its internal representation only once.For instance, if all the workers belonging to the samegroup return the same result, we will use only one entryto update the interaction between this group and theother groups.

A. Interaction Model

Each interaction between or within groups are repre-sented by a random variable having a beta distribution.Using a beta distribution comes from the Bayesianinference theory in which observations are used toupdate or infer the probability of an hypothesis [11].In our case, it works as follows. A beta distributionhas two parameters α and β. The average value of thatdistribution represents the estimated probability of aninteraction and is equal to α

α+β . At the beginning ofthe process, we set α = 1 and β = 1. When an eventreinforces an interaction, we increment the α parameterof the corresponding distribution and we incrementthe β parameter in the opposite case. Moreover, theconfidence interval of a given distribution represents theconfidence that we have in the corresponding estima-tion. This confidence interval is a function of α and β(i.e., the number of observations).

1) Collusion-based Representation: The first repre-sentation of the interactions is based on estimationsof collusion probabilities between groups. A matrix Ccontains the beta distributions corresponding to the col-lusion probability estimation between observed groups.The estimated probability that workers of groups i and

j return the same incorrect result for a job is themean of this distribution which we denote by cij . Thisrepresentation has one drawback. When two workersreturn the same result we need another mechanism todetermine if this result is correct or incorrect to decideif there is collusion or not. Since this mechanism hasto be based on the previous estimates, there is a riskthat the system might amplify estimation errors and failto converge. We propose an adaptation technique inSection IV-B1 that allows us to remedy this problem.

Let L be any subset of workers (L ⊆W ) and KL bethe set of groups covering all of the workers of L, i.e.,KL =

⋃w∈LOw. Let EL be the event that all workers

of L have colluded to return the same incorrect resultfor a given job. The following two lemmas are used toproduce estimations on the collusive behaviors betweenany subset of workers L.

Lemma 1.

0 ≤ Pr[EL] ≤ min(i,j)∈K2

L

(cij)

Proof: These bounds are trivially obtained fromthe definitions: probabilities are positive; the probabilitythat all workers in L collude cannot exceeds pairwisecollusion probabilities.

Lemma 2. Let KL = i∪ j ∪ k such that (i, j, k) ∈ G3.Then:

max(

0,cij + cik + cjk − 1

2

)≤ Pr[EL]

Pr[EL] ≤ min(cij , cik, cjk)

Proof: By definition, groups i and j either colludetogether with k, or they collude together without k:cij = Pr[EL] + Pr[Ei∪j ∩ EL]. Analogously, cik =Pr[EL]+Pr[Ei∪k∩EL] and cjk = Pr[EL]+Pr[Ej∪k∩EL]. Also, Pr[Ei∪j ∩EL]+Pr[Ei∪k∩EL]+Pr[Ej∪k∩EL] + Pr[EL] ≤ 1 because the events are disjoint.Hence:

cij + cik + cjk − 12

≤ Pr[EL]

The upper bound is obtained by applying Lemma 1 andby observing that ∀(i, j), cij ≤ cii (groups never colludemore with others than they do internally).

The worst case scenario that maximizes the boundsgiven in Lemma 2 is achieved when cij = cik = cjk =13 . In this case, the range defined by the bounds is 1

3 .2) Agreement-based Representation: The second

representation is based on absolute observations, i.e.,agreements between groups. A matrix A contains thebeta distributions corresponding to agreement proba-bility estimation between the observed groups. Theprobability that workers of group i return the sameresult (correct or incorrect) as the workers of group j

is the mean of the corresponding distribution which wedenote by aij . With this absolute representation, thereis no need for an external mechanism to determine ifthe result is correct and hence the system will converge.

The following two lemmas provides bounds for esti-mating the probability that a set of workers L colludetogether to return the same incorrect result, i.e., Pr[EL].

Recall that the largest group is indexed to be 1 andis assumed to be the non-colluding group.

Lemma 3. Let KL = i∪ j such that (i, j) ∈ G2. Then:

max (0, aij − a1i, aij − a1j) ≤ Pr[EL]

Pr[EL] ≤ min(aij ,

1 + aij − a1i − a1j

2

)Proof: We separate the sample

space into the following disjoint events:Ei ∩ Ej neither group i nor j colludeEi ∩ Ej group j collude but not group iEi ∩ Ej group i collude but not group jEL ∩ Ei ∩ Ej group i and j collude but not togetherEL group i and j collude together

By construction, a1i = Pr[Ei ∩ Ej ] + Pr[Ei ∩ Ej ],a1j = Pr[Ei ∩ Ej ] + Pr[Ei ∩ Ej ], andaij = Pr[Ei ∩ Ej ] + Pr[EL]. Since, all the samplespace is covered by these events, the sum of theirprobability is equals to 1. Thus:

Pr[EL] =1 + aij − a1i − a1j

2− Pr[EL ∩ Ei ∩ Ej ]

2

Pr[EL∩Ei∩Ej ] ≤ 1−a1i because group i colludesless often than it agrees with the largest group (assumedto be a non-colluding group). Analogously, Pr[EL ∩Ei ∩ Ej ] ≤ 1 − a1j and the lower bound can then bededuced. The superior bound is derived by noting thatprobabilities are positive and the collusion probabilitybetween groups i and j cannot exceed their agreement.

The largest range of the bounds given in Lemma 3 is13 when aij = a1i = a1j = 1

3 .

Lemma 4.

0 ≤ Pr[EL] ≤ min(i,j)∈K2

L

(aij ,

1 + aij − a1i − a1j

2

)Proof: It is a direct generalization of Lemma 3.

3) Relation between both Representations: We nowpresent some mathematical relations between these tworepresentations.

Theorem 1. Let C = cij and A = aij representthe actual group collusion and agreement matricesrespectively. The largest group is indexed to be 1 andis assumed to be the non-colluding group. Then, we

can bound A from C and C from A with the followingequations:

aij ≤ 1 + 2× cij − cii − cjj

cij ≤1 + aij − a1i − a1j

2Proof: Groups i and j are agreeing either if they

do not collude or if they collude together. Hence:

aij = Pr[Ei∪j ∪ Ei ∪ Ej ]

aij = Pr[Ei∪j ] + 1− Pr[Ei ∪ Ej ]

aij = cij + 1− Pr[Ei]− Pr[Ej ] + Pr[Ei ∩ Ej ]

aij = cij+1−cii−cjj+Pr[Ei∪j ]+Pr[Ei∪j∩Ei∩Ej ]

As there is the case where groups i and j colludeindividually, we end up with the following bound:

aij ≤ 2× cij + 1− cii − cjj

Bounding matrix C from matrix A is an application ofLemma 4:

cij ≤1 + aij − a1i − a1j

2

4) Evaluation of the errors in Equation 1: Basedon the above result we can now expand equation (1).We use the upper bound of the estimated probabilityof collusion of workers in g ∪ g′ to compute eg∪g′ .Hence, we have: eg∪g′ = |cg,g′ − min(i,j)∈K2

g∪g′(cij)|

when using the collusion representation and eg∪g′ =∣∣∣cg,g′ −min(i,j)∈K2g∪g′

(aij ,

1+aij−a1i−a1j

2

)∣∣∣ when us-ing the agreement representation.

B. On-line Algorithm

Above, we have described the data structures used inthe algorithm. We now describe how groups are builtand updated. This is done through the merge and splitof worker groups. Initially, each worker is put in asingleton group. After some interactions, there may beenough observations to determine a similarity betweentwo groups, which are then subsequently merged into asingle group. Since the conditions for merging groupsof workers is statistical, erroneous merges can happen.In this case, a worker should be separated from a set ofworkers when it is observed to behave differently. Wecall this operation a split. The objective is to create acorrespondence between the observed groups of similarworkers, and the actual non-colluding and colludinggroups. The conditions for merging and splitting aredifferent in both representations (collusion and agree-ment). Note that due to merge and split the size of theagreement and collusion matrices change with time.

Algorithm 1: Dynamic merge and split with collusion representationforeach event e do

if e =< t, w, j, r > then // Job resultforeach v ∈Wr′,j , r′ 6= r do // v found a result different than w1

if |Wr′,j | 6= 1 then // The result cannot be a failure because several workers have2computed it

if it corresponds to an observation between 2 worker sets that was not already considered for job j then3if Ov = Ow then // v and w computed two different results but are observed to be4in the same group

split w from Ow5else // v and w are not observed to be in the same group6

decrease the collusion probability estimation between Ov and Ow7

else // e =< t, j > Job completion8execute external mechanism for certifying the best result among those generated for job j9foreach result r computed for job j do10

if |Wr,j | = 1 then11the result might be a failure thus the observation is discarded12

else13foreach (v, w) ∈W 2

r,j do14if the interaction between Ov and Ow has not already been considered for job j then15

if r is the certified result then16Decrease the collusion probability estimation between Ov and Ow17

else18Increase the collusion probability estimation between Ov and Ow19

if Ov 6= Ow and merge is possible then20merge Ov and Ow21

1) Collusion-based Grouping: The procedure usedwith the collusion representation is depicted in Algo-rithm 1. We call Wr,j the set of workers which haveall computed r for job j. Each event is examinedin chronological order as would be done in an on-line scheduling framework. The observed groups withwhich w disagrees are then considered (line 1). If theinteraction is relevant and new, either a split happensbecause each worker in the same group should agree(line 5) or the collusion estimation is decreased as bothgroups disagree (line 7). The second kind of eventdenotes the termination of a job which triggers thecertification mechanism (line 9). The proposed algo-rithm does not depend on any specific certificationmechanism and hence this mechanism is not describedhere. However, it is expected that such a mechanismis sufficiently accurate to give relevant answers andensures convergence. Once a result has been certifiedand thus considered to be correct, all of the results areexamined (line 10). If only one worker has computeda particular result, it is possible that the system faces afailure and hence no update of the collusion probabilityis performed (line 12). Otherwise, we consider all pairsof workers that have returned this result (line 14). Ifthe interaction between the observed groups Ow andOv has not been accounted for (line 15), we decreaseor increase the estimation of the collusion probability

between these two groups based on whether r is thecertified result or not (lines 16-19). Last, if the twogroups are different but share common characteristics,we merge them (lines 20-21). In all the cases (line 7,17, 19), the update of the collusion probability is doneusing the method described in Section IV-A.

By assumption, the largest group must be the non-colluding group (the percentage of colluders is assumedto always be below 50%) and hence the probabilityof collusion within this group must be 0. However,when a merge or split happens, the largest group maychange and the internal collusion probability may notnecessarily be 0. It is the job of the adaptation processto update the observed collusion probabilities to makethem coherent. Without loss of generality, let 1 bethe index of the new largest group. The adaptationprocess modifies the matrix C in two steps: the collusionmatrix is first converted into an agreement matrix, with-out changing the composition of the observed group;bounds are then used to regenerate a collusion matrixC ′. From Theorem 1, we have the following inequalityfor each element of the new matrix:

c′ij ≤ cij − c1i − ˆc1j + ˆc11

As this is the best bound we have, we use it forcomputing the adapted matrix.

These arithmetic operations are actually performed

on the beta distributions related to the collusion estima-tions. Since addition operations over random variablespropagate errors, the resulting distributions have lessprecision. Adaptation should thus be avoided if possible.

The merging condition is critical for the successof the algorithm. The key idea is to favor early butpossibly imprecise merges and late but precise merges.Two groups need to be merged if they collude togetherwith the same probability as when they are operatingalone. Let us first describe some notation for specify-ing the merging condition between observed groups iand j (line 20). Let cii, cij and cjj be the collusionestimates. N(c) is the number of observations thatled to c and ec is its error interval. Let m be thenumber of observed groups. Then, i and j are mergedif: |cii−cij | <

ecii+ecij

2 ∧|cii−cjj | <ecii

+ecjj

2 ∧|cij−cjj | <

ecij+ecjj

2 and if min(N(cii), N(cij), N(cjj)) >γ×max(|i∪j|, n

m )

(1−|cii−cij |)(1−|cii−cjj |)(1−|cij−cjj |) . Where

γ =

2.5if the largest observed group will changedue to this merge

1 otherwise

The first set of conditions check the consistency ofthe values. The second condition imposes a minimalnumber of observations that depends on four principles.The γ value, found empirically, reduces the likelihoodof merges if the largest observed group will change dueto the merge (as it would then induce an adaptationwhich introduces uncertainties in the estimations). |i∪j|allows us to increase the precision requirement whenobserved groups grow larger. The idea of early andimprecise merge is meaningless when large observedgroups already exist (this is the meaning of n

m ). Fi-nally, the denominator takes into account the numericalsimilarities between the collusion estimations.

The value of 2.5 for γ is empirical. It has to be largerthan 2 but values greater than 3 hinder the convergencespeed. The chosen value has been experimentally testedand leads to a good compromise between accuracy andconvergence speed.

An additional splitting mechanism takes place whenan update is performed on the collusion probabilities. Ifthe collusion estimation of the largest observed grouphas to be increased, then the worker responsible for theincrease is split from the observed group.

2) Agreement-based Grouping: The agreement rep-resentation requires a simpler algorithm than the collu-sion case as shown in Algorithm 2. First, the agreementand disagreement observations are easy to detect. More-over, the second kind of event (< t, j >) is not neededand there is no need for a result certification mechanism.

Merge and split operations are simpler. Observedgroups i and j are merged only if no disagreement

between the groups have been observed (i.e., aij = 1)and if the number of observations is greater than |i∪ j|.Splits only happen if two workers from the sameobserved group disagree.

V. EMPIRICAL VALIDATION

Both approaches are compared empirically usinglarge-scale (on the order of 1 million) real-life inputs(i.e., values of < t,w, j, r > and < t, j >). Sinceno complete trace is available for our study as of thiswriting, we aggregate different traces of several desktopgrid projects with a synthetic model of threat in orderto obtain complete traces. Both heuristics are then runwith all of the generated traces.

A. Input Description

The first main source of traces used is the Fail-ure Trace Archive (FTA [12]) which provides workeravailability traces of parallel and distributed systems.In particular, we use availability and performance in-formation from the SETI@home project. Additionalprojects we used include: Overnet, a P2P network; and,Microsoft, an enterprise network. While these tracesallow us to assign jobs to machines, the job durationsneed to be defined to time-stamp the inputs. MichelaTaufer et. al. [13] modeled the in-progress delay, i.e.,the computation time required by jobs in several desktopgrid projects. Additionally, she provided us a workloadtrace of the docking@home volunteer project, which wehave used.

Jobs are assigned to workers using a quorum-basedscheduling algorithm such as the one used in BOINC.Namely, each job is first assigned to k workers. When-ever a quorum of q is achieved for a given job (q workersagreeing on an identical result), no more workers areassigned to this job. Moreover, a job cannot be assignedto more than l workers, though a large l may be neededto detect collusion. The performance heterogeneity isdetermined by normalizing the job durations for eachworker. For example, we consider that a machine twiceas fast as the mean speed will take half as long fora given job. Whenever a worker becomes unavailable,its current computation is postponed until it rejoins thenetwork. Finally, a timeout is associated to each job (14days) and to each worker computation (4 days).

We overlay a threat model over the trace which allowsworkers to return erroneous results. Any worker returnsa correct result by default, except if it fails or if itcolludes. Failures are modeled as follows: a percentageof workers are completely reliable; unreliable workersreturn a reliable result with the same reliability proba-bility. Colluding groups are based on the same principle:several "fractions" of colluders, one for each group, are

Algorithm 2: Dynamic merge and split with agreement representationforeach event e do

if e =< t, w, j, r > then // Job resultif |Wr,j | 6= 1 then // The result cannot be a failure because several workers have computed it

foreach v ∈Wr,j do // v and w found the same resultif it corresponds to an observation between 2 worker sets that was not already considered for job j then

increase the agreement probability estimation between Ov and Ow

if Ov 6= Ow and merge is possible thenmerge Ov and Ow

foreach v ∈Wr′,j , r′ 6= r do // v and w found a different resultif |Wr′,j | 6= 1 then // The result cannot be a failure because several workers havecomputed it

if it corresponds to an observation between 2 worker sets that was not already considered for job j thenif Ov = Ow then

split w and v from Ow

elsedecrease the agreement probability estimation between Ov and Ow

specified; for each group, a probability of collusion isspecified.

Table II depicts the empirical settings. For eachparameter, every tested value is used (with all the otherparameters set to their default value) for generatinga trace. For the reliability and the collusion relatedparameters, 4 scenarios for reliability and 9 scenariosfor collusion are generated. The additional Pair settingmeans that 40% of the workers belong to one of twocolluding groups of equal size. In the first group, theprobability to collude is 0.2 while in the second groupworkers always collude. In addition, we have used5 parts of the huge SETI@home trace at differenttime periods to yield separate traces. This leads to 27scenarios plus the default one. For each scenario, theseeds used for random generation (workload model andthreat setting) take 20 distinct values. We then generate560 traces on which both heuristics are run.

B. Results Analysis

For the first plot (Figure 2), we show a typicalrun that illustrates the accuracy evolution with time.After less than 10 hours, the RSMD drops for bothapproaches and stabilizes until the end of the trace.For this trace, the agreement-based algorithm eventuallyregroups workers into 3 sets (of sizes 79, 20 and 1)which matches the actual sets except for the singleton.Additionally, the RMSD is largely impacted by theagreement estimation between the two main sets whichis 0.8 when it should be 0.5 (however, there is no errorin the internal agreement probability of 1 in both cases).The collusion-based algorithm produces the same setsof size 79 and 1. However, the set of colluders, of size20, is separated into 3 sets of sizes 10, 7 and 3. It’sprobability of collusion is estimated to be 0.28 when itis 0.5 in reality, and it is the main source of error that

impacts the RMSD. In both cases, the estimated workergroup structure is accraute as there are no false positivesand negatives (workers wrongly considered as colludersor wrongly considered as non-colluders).

Although both heuristics have converged in Figure 2,the collusion is underestimated (and the agreement over-estimated). This phenomenon is related to the followingconditions that must all hold: a collusion probabilitylower than 1, a low quorum value, and a small fractionof colluders. With these settings, there are many caseswhere a worker assigned to a given job produces awrong result due to collusion that is not matched byany of the results produced for this job. This happensif this worker is the only one of its colluding groupassigned to this job and it is treated as a buggy worker.In this case, the wrong result is considered as being dueto unreliability and not collusion by our approach (seeline 12 of Algorithm 1). This inability to determine thecause of an incorrect orphan result implies that even ifour techniques converge, it will produce less accurateestimations. A simple solution would be to incrementthe quorum (to reduce the risk of having orphan results)or to resubmit the job to a new worker in the same groupof the orphan. However, these workarounds concern thescheduling level and are thus left to future work.

The next set of figures show results across the differ-ent traces. Generally, it is expected that our heuristicsconverge before the end of the traces (for instance,3 days for most of the traces). We have empiricallyobserved that when the RMSD is below 0.2, the struc-ture of the worker groups is correctly estimated (i.e.,observed groups are close to the real groups).

Figures 3 to 8 illustrate the convergence time andstabilized accuracy measures for both heuristics. Eachfigure depicts the effect of the variation of one param-eter, the others being set to their default value. The

Parameter Default value Tested valuesWorker availability trace Seti@home Overnet, Microsoft, . . .Workload model or trace charmm mfold, docking@home

Quorum (k, q, l) (4, 3, 10) (2, 1, 2), (19, 15, 19)Number of workers (n) 100 30, 50, 70, 80, 200

Reliability (fraction, probability) (0.7, 0.7) {0.7, 0.99} × {0.7, 0.99} \ (0.7, 0.7)Collusion (fraction, (0.2, 0.5) {0.02, 0.2, 0.49} × {0.01,

probability) 0.5, 1} \ (0.2, 0.5), Pair

Table II: Experimental Paramaters

● ●●●●●

●●●

●

●●

●

●●●

●

●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●

●●

●

●

●●●●

●

●●●●●●●●●●●●

●●

●

●

●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.10

0.15

0.20

0.25

0.30

Error committed at each iteration

Time (days)

Col

lusi

on R

MS

D

● Agreement representationCollusion representation

Figure 2: Analysis of a specific run with default settings.

results are graphically represented through the use ofviolin plots. These plots are a combination of box plotand kernel density plot. The box plot is the internalline and rectangle in the painted area. The rectangleextends from the median (the white mark) in bothdirections to the first and third quartile. The line extendsto 1.5 times the interquartile range beyond each of thequartiles. Intuitively, most measures are concealed inthe area where the line is drawn and 50% of them arein the rectangle. The kernel density is an estimationof the density of measured points. It can be thoughtas a smoothed histogram. The larger an area, the morepoints are contained within it. Violin plots are usefulwhen multiple modalities exists, i.e., when measures areregrouped in multiple locations.

Several traces are used and the efficiency of ourmethods is represented in Figure 3. We see that sim-ulations behave similarly for each SETI@home trace.Namely, both heuristics converge in less than half aday in most of the cases. Additionally, the stabilizedaccuracy is better with collusion, though with agreementit is still within a tolerable range. With the microsoft

trace, the convergence time is lower. Indeed, participantsare more active in an entreprise network and thus thereare more interactions. Although the overnet trace revealssimilar stabilized accuracy than with the other traces, theconvergence times are strangely distributed. This has yetto be investigated.

Several workloads are represented in Figure 4. Wecan draw the same conclusions about the stabilizedaccuracy, namely, collusion is better than the agreementrepresentation and results are independant of the work-load. The convergence times for the docking@homeproject are out of range because each job takes muchmore longer as compared to the charmm and mfoldworkload models. For these two last, however, theconvergence time is very similar.

− −− −

01

23

4

− − −

− − −

0.0

0.1

0.2

0.3

0.4

0.5

charmm docking mfold

Agreement convergence timeCollusion convergence timeAgreement stabilized accuracyCollusion stabilized accuracy

Time required for convergence

Workload

Tim

e (d

ays)

Col

lusi

on R

MS

D

Figure 4: Workload model or trace

The effect of the quorum is represented in Figure 5.As expected, a quorum of one does not allow any of ourapproaches to converge. Indeed, with such a quorum,there is no interactions between worker groups (forbuilding an observation, at least four results distributedin two sets of equal size are needed). Although theconvergence time is similar with a quorum of 15, the

−

−

−

− −− −−

−

−

− −− −

01

23

4

− − − − − − −

− − − − − − −

0.0

0.1

0.2

0.3

0.4

0.5

seti1 microsoft overnet03 seti2 seti3 seti4 seti5



Avaibility trace

Tim

e (d

ays)

Col

lusi

on R

MS

D

Figure 3: Availability traces

−

− −

−

− −

01

23

4

−

−

−

−

−−

0.0

0.1

0.2

0.3

0.4

0.5

1 3 15



Quorum

Tim

e (d

ays)

Col

lusi

on R

MS

D

Figure 5: Quorum effect

agreement representation becomes extremely precise.Intuitively, this is due to the increase in observationsthat are done.

Figure 6 depicts the efficiency of our methods whenthe number of workers in the system increases. Resultsare stable until there are more than 100, with a slightincrease in convergence time. With 200 workers, halfof the runs do not converge before the end of the traceswhich explains why the precision degrades. Also, the

maximum convergence time decreases. Indeed, with 1million events in the input trace, time advances moreslowly if there are more workers and it takes a longertime to get a accurate estimate of this environment.

In Figures 7 and 8, we vary both the fraction ofcolluders and their probability to collude. For example,in the first column on Figure 7, the collusion probabilityis 0.01, but the collusion fraction is either 0.02, 0.2,or 0.49. The scenario where the probability is 0.5 andthe fraction is 0.02 appears to be the most difficult.Neither of our approaches are able to converge andthe upper modality in both plots corresponds to thissetting. In every case, specifying a probability of 0.01leads to a good accuracy with fast convergence relativeto the RMSD metric. However, for such small proba-bility values, the RMSD is perhaps not the best wayto measure accuracy. Indeed, our algorithms convergeto estimations equal to 0 for which the RMSD (thatcomputes absolute difference) is very low. Actually thissetting is very hard to detect and the quick convergenceindicates that our techniques do not detect the collusion.These low values explain the lower modality in Figure 8.On the other hand, colluding almost surely allows aneasier and more precise detection for both heuristicsthan when it occurs with a 0.5 probability. The highprecision indicates that the heuristics performs wellwhen the probability value to estimate is close to 1.Another reason for which a probability of 1 is easier todetect is that it requires statistically fewer observationsto be precisely estimated. It was expected that having anear majority of colluders (49%) would be harder to

− − − −

−

− − − −

−

01

23

4

−− − −

−

−− − −

−

0.0

0.1

0.2

0.3

0.4

0.5

30 50 70 100 200



Worker quantity

Tim

e (d

ays)

Figure 6: Worker quantity effect

detect than with a significant but limited fraction ofcolluders. Although there are indeed some cases withthe 0.49 fraction that do not converge, the difference issmall with the 0.2 fraction. Finally, dealing with a pairof colluding groups does not present a challenge for ourapproach.

−

−− −

−

− − −

01

23

4

−

−

−

−

−

−

−

−

0.0

0.1

0.2

0.3

0.4

0.5

0.01 0.5 1 Pair



Collusion probability

Tim

e (d

ays)

Col

lusi

on R

MS

D

Figure 7: Collusion probabilities effect

Lastly, as our systems rely on the use of beta distri-butions, they are able to give an error interval for eachestimation called the introspection. We assume that suchan interval is an estimation of the error and we measurethe accuracy of the introspection (the error on the errorestimation). All these errors are aggregated with theRMSD (between the estimated error and the real error)

− − − −

−

− − −

01

23

4

− − −

−

− −

−−

0.0

0.1

0.2

0.3

0.4

0.5

0.02 0.2 0.49 Pair



Collusion fraction

Tim

e (d

ays)

Col

lusi

on R

MS

DFigure 8: Colluder’s fraction effect

and their statistical repartition for each run is shownon Figure 9 as an empirical cumulative distributionfunction (ECDF). Each value is the median RMSD overall iterations for one specific run. All of the valuesgreater than 0.22 (8% of the measures) correspond tosettings for which our algorithms do not converge. 63%of the error RMSD are between 0.1 and 0.22. This isrelated to the fact that, in most of the settings, the finalcollusion RMSD is subject to these bounds.

Finally, better introspection is observed for the col-lusion representation.

0.0 0.1 0.2 0.3 0.4

0.0

0.2

0.4

0.6

0.8

1.0

Median RMSD of the estimated error for each run

Introspection accuracy

EC

DF

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●

●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●● ●●●

●●●●●●●●●

● ●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ● ●●●●

●●●●●●●

●

●●●●●●●●●●

●●●●●● ● ●

● Agreement representationCollusion representation

Figure 9: Empirical cumulative distribution function ofthe introspection accuracy for all runs.

VI. CONCLUSION

Coordinated attacks against a desktop grid remaina major threat to their correct functioning. In thispaper, we proposed two algorithms for detecting andcharacterizing collusion in a desktop grid. Despite thefact that the threat model is very strong (a worker maybelong to one or several groups, groups can cooperate,colluders may sometime send a correct result to stayundetected, colluders are not required to synchronizeto send the same incorrect result, and no informationis known a-priori by the server), the input events areminimal (we only know which workers have returneda given result for a given job) and that no assumptionis made on the way jobs are scheduled to workers, theproposed solutions are very effective. Indeed, experi-mental results show that they are able to accurately andrapidly characterize collusion, most of the time. Amongthe two algorithms, we see that the collusion method isslightly better than the agreement method especially interms of accuracy. However the agreement method ismuch simpler to implement and does not rely on anycertification mechanism.

Future works are directed towards non-stationarity(when worker behavior changes with time). A simplemethod would be to reset the probabilities from timeto time. Other techniques for non-stationarity includ-ing aging of prior observations to enable more rapidtransitions will be studied. We also plan to work onthe design of a scheduling method for a desktop gridthat would defeat collusion. Indeed, we think that our

characterization mechanism is accurate and fast enoughto be used at the scheduling level. Finally, we anticipatethat a scheduling algorithm could improve the collusioncharacterization by adapting the replication ratio accord-ing to its estimated accuracy.

REFERENCES

[1] D. P. Anderson, “Boinc: A system for public-resourcecomputing and storage,” in GRID. IEEE ComputerSociety, 2004, pp. 4–10.

[2] D. P. Anderson, J. Cobb, E. Korpela, M. Lebofsky, andD. Werthimer, “SETI@home: An Experiment in Public-Resource Computing,” Communications of the ACM,vol. 45, no. 11, pp. 56–61, 2002.

[3] “Climateprediction.net,” http://climateprediction.net/.

[4] “Folding@home,” http://folding.stanford.edu/.

[5] S. Zhao, V. M. Lo, and C. Gauthier-Dickey, “Resultverification and trust-based scheduling in peer-to-peergrids,” in Peer-to-Peer Computing. IEEE ComputerSociety, 2005, pp. 31–38.

[6] J. D. Sonnek, A. Chandra, and J. B. Weissman, “Adap-tive reputation-based scheduling on unreliable distributedinfrastructures,” IEEE Trans. Parallel Distrib. Syst.,vol. 18, no. 11, pp. 1551–1564, 2007.

[7] S. D. Kamvar, M. T. Schlosser, and H. Garcia-Molina,“The EigenTrust Algorithm for Reputation Managementin P2P Networks,” in WWW ’03: Proceedings of the12th international conference on World Wide Web. NewYork, NY, USA: ACM, 2003, pp. 640–651.

[8] E. P. D. Jr. and T. Nanya, “A hierarchical adaptive dis-tributed system-level diagnosis algorithm,” IEEE Trans-actions on Computers, vol. 47, no. 1, pp. 34–45, 1998.

[9] M. Yurkewych, B. N. Levine, and A. L. Rosenberg, “Onthe cost-ineffectiveness of redundancy in commercialp2p computing,” in ACM Conference on Computer andCommunications Security. ACM, 2005, pp. 280–288.

[10] G. Silaghi, P. Domingues, F. Araujo, L. M. Silva, andA. Arenas, “Defeating Colluding Nodes in DesktopGrid Computing Platforms,” in Parallel and DistributedProcessing (IPDPS), Miami, Florida, USA, Apr. 2008,pp. 1–8.

[11] A. Jøsang, “The beta reputation system,” in In Proceed-ings of the 15th Bled Electronic Commerce Conference,2002.

[12] “Failure trace archive,” http://fta.inria.fr/.

[13] T. Estrada, M. Taufer, and K. Reed, “Modeling joblifespan delays in volunteer computing projects,” inCCGRID ’09: Proceedings of the 2009 9th IEEE/ACMInternational Symposium on Cluster Computing and theGrid. Washington, DC, USA: IEEE Computer Society,2009, pp. 331–338.

A Dynamic Approach for Characterizing Collusion …jon/papers/inria_collusion.pdfA Dynamic Approach...

Documents

Transcript of A Dynamic Approach for Characterizing Collusion …jon/papers/inria_collusion.pdfA Dynamic Approach...