A Markov Decision Process Model for High Interaction Honeypots

This article was downloaded by: [Moskow State Univ Bibliote]On: 27 December 2013, At: 14:23Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

Information Security Journal: A Global PerspectivePublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/uiss20

A Markov Decision Process Model for High InteractionHoneypotsOsama Hayatle a , Hadi Otrok b & Amr Youssef aa Concordia Institute for Information Systems Engineering, Concordia University , Montreal ,Quebec , Canadab Electrical and Computer Engineering Department , Khalifa University of Science,Technology & Research , Abu Dhabi , UAEPublished online: 18 Nov 2013.

To cite this article: Osama Hayatle , Hadi Otrok & Amr Youssef (2013) A Markov Decision Process Model for High InteractionHoneypots, Information Security Journal: A Global Perspective, 22:4, 159-170, DOI: 10.1080/19393555.2013.828802

To link to this article: http://dx.doi.org/10.1080/19393555.2013.828802

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/loi/uiss20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/19393555.2013.828802

http://dx.doi.org/10.1080/19393555.2013.828802

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

Information Security Journal: A Global Perspective, 22:159–170, 2013Copyright © Taylor & Francis Group, LLCISSN: 1939-3555 print / 1939-3547 onlineDOI: 10.1080/19393555.2013.828802

A Markov Decision Process Model for HighInteraction Honeypots

Osama Hayatle1, HadiOtrok2, and Amr Youssef1

1Concordia Institute forInformation SystemsEngineering, ConcordiaUniversity, Montreal, Quebec,Canada2Electrical and ComputerEngineering Department,Khalifa University of Science,Technology & Research,Abu Dhabi, UAE

ABSTRACT Honeypots, which are traps designed to resemble easy-to-compromise computer systems, have become essential tools for security professionalsand researchers because of their significant contribution in disclosing the underworldof cybercrimes. However, recent years have witnessed the development of severalanti-honeypot technologies. Botmasters can exploit the fact that honeypots shouldnot participate in illegal actions by commanding the compromised machine to actmaliciously against specific targets which are used as sensors to measure the exe-cution of these commands. A machine that is not allowing the execution of suchattacks is more likely to be a honeypot. Consequently, honeypot operators need tochoose the optimal response that balances between being disclosed and being liablefor participating in illicit actions. In this paper, we consider the optimal responsestrategy for honeypot operators. In particular, we model the interaction betweenbotmasters and honeypots by a Markov Decision Process (MDP) and then deter-mine the optimal policy for honeypots responding to the commands of botmasters.The model is then extended using a Partially Observable Markov Decision Process(POMDP) which allows operators of honeypots to model the uncertainty of thehoneypot state as determined by botmasters. The analysis of our model confirmsthat exploiting the legal liability of honeypots allows botmasters to have the upperhand in their conflict with honeypots. Despite this deficiency in current honeypotdesigns, our model can help operators of honeypots determine the optimal strategyfor responding to botmasters’ commands. We also provide simulation results thatshow the honeypots’ optimal response strategies and their expected rewards underdifferent attack scenarios.

KEYWORDS honeypots, botnets, Markov Decision Process

Address correspondence to Dr. HadiOtrok, Electrical and ComputerEngineering Department, KhalifaUniversity of Science, Technology &Research, P. O. Box 127788, Abu Dhabi,UAE. E-mail: [email protected]

1. INTRODUCTIONThrough the long combat between security professionals and attackers, the latter

were always one step ahead. It has always been the case that attackers create newhacking tools, use them for a while, and then security professionals come up withsolutions against these tools. However, the game rules may change with the wide useof honeypots. Honeypots are traps designed to resemble easy-to-compromise systems

159

Dow

nloa

ded

by [

Mos

kow

Sta

te U

niv

Bib

liote

] at

14:

23 2

7 D

ecem

ber

2013

in order to tempt hackers to invade them and then collectvaluable information about the botmasters’ techniques,tools, and even their true identities. When attackers tar-get a honeypot, all their commands, techniques, and codesare captured as they become under the observation of secu-rity professionals who operate the honeypot. Dependingon their complexity, honeypots are categorized into twogroups: (i) high interaction honeypots that resemble realmachines with diversity of services, and (ii) low interactionhoneypots that resemble machines with limited services(Provos, 2004).

Honeypots are becoming a major source of informationfor security communities and are used around the world tocapture and analyze information about attackers. On theother hand, detecting honeypots has become an activearea of research by both attackers and security profession-als Ferrie, 2006; Krawetz, 2004). Zou and Cunningham(2004) suggested a detection methodology by exploitingthe legal liability of honeypots when participating in illegalactions such as Distributed Denial of Service (DDoS) ormalware spreading. In this methodology, botmasters com-mand the compromised machine to attack targets that actas sensors. The machine response to these commands ismeasured by the sensors and its true nature is determined.To prevent honeypots from recognizing these tests, Zouand Cunningham suggested to mix test commands withreal attack commands such that the honeypot operatorsmay be legally liable if decided to execute botmaster’s com-mands. It is not possible for current honeypot technologyto avoid such detection technique. This raises the need fordeciding the optimal strategy that prolongs honeypots stayin the botnet while avoiding high legal liability. In thispaper, we model the interaction between the honeypotand the botmaster using Markov Decision Process (MDP)(Puterman 1994) in order to determine the honeypot opti-mal response to such test techniques. In particular, in thefirst part of this work, the honeypot-botmaster interactionis modeled by MDP with a set of states, actions, and tran-sition probabilities. Depending on the beginning and endstates, the system may acquire rewards or costs in eachtransition. These transitions are determined by the currentstate, the selected action, and the transitions probabili-ties. Our work shows how honeypot operators can selectthe optimal strategy by considering different factors andparameters, for example, liability cost, honeypot opera-tion cost, honeypot reset cost, probability of attacks, andprobability of disclosure. Then we extend our model to amore realistic scenario using Partially Observable Markov

Decision Process (POMDP), which allow us to deal withthe uncertainty associated with the fact that the honeypotstate as determined by the botmaster is not known to thehoneybot operator.

The models developed in this paper can help securityprofessionals to:

• Determine the optimal responses to botmasters’ com-mands.

• Prolong the honeypot lifetime inside botnets while min-imizing its legal liability.

The remainder of this paper is organized as follows.In section 2, we briefly review some related work in the areaof honeypot detection and the use of MDP in the area ofinformation systems security. In section 3, we describe ourMDP-based model. In section 4, we discuss how to selectthe optimal policy for our model based on available param-eters. In section 5, we extend the developed model usingPOMDP. Our simulation results and discussion are pre-sented in section 6. Finally we summarize our conclusionsin section 7.

2. RELATED WORKIn this section, we briefly review some of the

research related to the area of honeypots detec-tion and the use of MDPs in information systemssecurity.

2.1. Honeypot DetectionThe wide use of honeypots as security surveillance sen-

sors motivated hackers, as well as security researchers,to explore the weaknesses of these important tools anddevelop techniques and tools that can be used to dis-close them. Ferrie (2006) described techniques that canbe used to detect virtual machines which are usuallyused to deploy honeypots. Detecting virtual environ-ments relies on measuring the latencies that take placewhen communicating between the host operating systemand the virtual machine. By comparing these latencieswith the measurements of other normal machines, virtualmachines can be recognized. The common deploymentof honeypots as virtual machines encouraged researchersto investigate other methodologies of virtual machinesdetection. Fu et al. (2006) showed the possibility todetect virtual environments by measuring the network

160 O. Hayatle et al.

Dow

nloa

ded

by [

Mos

kow

Sta

te U

niv

Bib

liote

] at

14:

23 2

7 D

ecem

ber

2013

connections time in suspicious machines and comparingthem to other machines in the same environment. Krawtez(2004) showed how the “Honeypot Hunter” tool is ableto detect honeypots by determining their ability to sendspam emails. The operators of honeypots do not allowspam emails to be sent from within honeypots. Thus, bycommanding the honeypot to send spam emails, that arereceived by the Honeypot Hunter, it is possible to distin-guish honeypots from real machines. Hayatle et al. (2012)used Dempster-Shafer theory to combine different evi-dence and calculate the belief about the true nature of themachine. In particular, they suggested to collect evidencethroughout different phases of machine compromise, toassign a belief value for each evidence that reflects theevidence support for the machine type (i.e., normal orhoneypot) and then combine the beliefs using Dempster-Shafer theory. This approach enables attackers to systemat-ically collect and combine evidence while interacting witha suspicious machine to make a decision about its truenature.

2.2. Using Markov Decision Processin Security

MDPs are widely used as optimization tools for deter-mining optimal strategies in automated systems (e.g., see(Shirazi, Kong, & Them, 2009; Abbas & Li, 2012). Jhaet al. (2002) explained how to interpret attack graphs asMDP models and solve these models to determine theoptimal defense strategy that minimizes the probability ofattack success. Taibah et al. (2006) presented a dynamicdefense architecture against email worms where they clas-sified emails into categories based on their threat level andthen used an MDP model to select the optimum set ofactions, such as, quarantine, analyze, or drop, that can beapplied to maximize the architecture outcome by increas-ing security and decreasing the latency. (kreidl, 2010)employs MDP to determine the optimal action againstattack attempts where the author found that depending onattack severity, it is not always optimal to defend againstattacks as it is possible that the overhead caused by con-tinuous defense strategy may exceed the cost of the attackitself. The findings were analyzed in the cases of perfect andimperfect detectors using MDP and POMDP. Liu et al.(2009) used POMDP to design a framework for combin-ing intrusion detection and continuous authentication inmobile ad hoc networks (MANETs).

3. MARKOV DECISION PROCESSMODEL FOR HIGH INTERACTION

HONEYPOTIn this section, we briefly review the principles of MDPs

and present our developed MDP model for honeypotinteraction with botmasters. An MDP (Petrik et al., 2010)is a tuple (S, A, P, r, d ), where S represents the set of sys-tem states, A represents the set of possible actions, and P isa transition function

P : S × S × A �→ [0, 1],

where P(s1, s2, a) is the probability of transiting from states1 to state s2 upon using action a. The reward function rrepresents the gain for using action a in state s:

r : S × A �→ R.

The discount factor d is the amount by which the gain isdiscounted over time.

3.1. Honeypot StatesAt any given time, the honeypot can be in one of three

possible states:

1. W (Waiting state): The honeypot is not targeted by theattacker yet and is waiting for attack attempts so thatit can join a botnet. In this state, the honeypot cannotcapture any information about the attackers as it doesnot have any interaction with them yet.

2. C (Compromised state): The honeypot has been suc-cessfully compromised by attackers and has become amember of their botnet. In this state, the honeypot isable to collect information about the attackers wheneverthey communicate with it. It is worth mentioning thatsome information are gained during the compromisingphase, that is, during the transition between state Wand state C .

3. D (Disclosed state): The true nature of the honeypotis detected by the attackers and it is no longer a mem-ber of their botnet. This can be the case when receivinga command to disconnect or remove the honeypotfrom the botnet. It is also possible to think that ahoneypot is disclosed when losing the interaction withthe botmasters for a relatively long period. However,security professionals must consider different phases ofthe botnet life cycle (Feily et al. 2009) as it is possible

Markov Model for High Interaction Honeypots 161

Dow

nloa

ded

by [

Mos

kow

Sta

te U

niv

Bib

liote

] at

14:

23 2

7 D

ecem

ber

2013

that botmasters focus on expanding their botnets anddo not communicate intensively with existing botnetmembers.

3.2. State TransitionsThe transition from one state to another is determined

by different factors and is associated with different rewards.In our model, transitions between states depend on twofactors:

1. Honeypot actions: At each state, the honeypot canchoose one of the following three actions:• A (Allow): The honeypot allows the botmasters to

compromise the system and to execute commandssuch as downloading files, installing and updatingthe bot code, and launching attacks from withinthe system. This enables the honeypot to infiltratethe botnet and prolong its stay in it. However, itcomes with the cost of possible liability in case ofparticipating in real attacks.

• N (Not-Allow): If a honeypot chooses this action atstate W , then the honeypot will not let the attack-ers compromise it. Consequently, the honeypot willnot be able to join any botnet. After compromisingthe system, that is, system is in state C , this action isused to prevent the attackers from sending malicioustraffic from the honeypot. One reason for followingthis action is to avoid the liability of participating inillegal actions. Also, the honeypot can be designed toreject some commands from the attackers in order toforce them to use new tools and techniques (Wageneret al., 2009). However, ignoring the botmasters’ com-mands may allow them to disclose the honeypot truenature.

• R (Reset): The honeypot is reset to its initial state(W). The honeypots cannot collect information afterbeing disclosed by the attacker. Thus, to make useof these resources, honeypots must be reinitialized asnew honeypots and be ready to lure new attackers.

2. Transition Probabilities: Beside the actions discussedabove, the transitions between the different statesin the model are also determined by the followingprobabilities:• Pa: This represents the likelihood of attacking the

honeypot. Pa affects the transition from state W tostate C . When a honeypot is in state W and is readyto execute the attackers’ commands (action A), it willbe part of a botnet (move to state C ) when having an

attack attempt, which may occur with the probabilityof Pa.

• Pd : This represents the likelihood that botmastersreveal the true nature of the honeypot and removeit from their botnet. Honeypot operators can con-sider their honeypots as disclosed when receiving akill command or when losing interaction with thebotmasters for a long period of time.

3.3. Transition RewardsThe transitions between states, including self-

transitions, are associated with the following rewards/costs:

• CO (Operation cost): This cost represents different fac-tors that are needed to deploy, run, and control thehoneypot.

• IV (Information value): One of the main purposes ofhoneypots is to collect information about attackers.Whenever botmasters interact with a honeypot, the lat-ter collects information about the attackers’ techniques,codes, and tools, which represent a significant sourceof knowledge for security professionals and the securitycommunity.

• CR (Reset cost): This represents the cost associated withresetting the honeypots, for example, cost associatedwith resetting virtual machines to their initial clean state.

• CL (Liability cost): Honeypot operators might becomeliable for executing the botmasters’ commands if suchcommands include illicit actions such as DDoS orspamming.

The values of the above mentioned costs are always pos-itive, and in practice, it is logical to assume that they satisfythe following conditions:

• CO < CL and CR < CL; security professionals avoid thelegal liability due to its high value compared to othercosts.

• CO < IV and CR < IV ; collecting information is themain objective of honeypots. The cost of operating orresetting the honeypot is less than the value of thecollected information.

In Figure 1, the transitions between states are repre-sented as arrows from the beginning state to the end state.Each transition has a label with the format of (action,reward, probability). For example, (R,−IV ,1) means thatthe system transits with probability 1 when action R is


Dow

nloa

ded

by [

Mos

kow

Sta

te U

niv

Bib

liote

] at

14:

23 2

7 D

ecem

ber

2013

FIGURE 1 The developed MDP model for the honeypot interaction with botmasters.

used and the associated cost is −IV . The system startsin state W , for example, and waits for an attack bybotmasters which occurs with a probability of Pa. Onceit is under attack, the honeypot can choose whetherto allow the attacker to compromise it (action A) ornot (action N ). If the attack is allowed to succeed, thehoneypot will be compromised and will join the botnet(state C ). Otherwise, it performs a self transition into stateW . In state C , the honeypot can hide its true natureby executing the botmaster commands (action A) whichmakes the botmasters think that the honeypot is a realmachine. Consequently, more information (IV ) can begained. However, this comes with risking the possibilityof being legally liable if the honeypot participates in illegalactions (CL). If the honeypot chooses not to execute thebotmaster’s commands (action N ), there is a probability ofdisclosure (Pd ) when the botmasters send test commands.This may lead the botmasters to disconnect the honeypotform the botnet and cause loss of information (IV ). If thehoneypot is disclosed, it moves to state D and stays thereuntil it is reset to the starting state W (action R). Also, thehoneypot can be reset to the starting state W from any stateto be ready and waiting for a new botmaster’s attack. Thisreset action comes with the cost of CR. Figure 1 depicts thedeveloped MDP model.

3.4. Analysis of the DevelopedModel

A recurrent MDP can be analyzed over either finite orinfinite planning horizon. To simplify our analysis, in thiswork we assume infinite interactions between the honeypotand the botmaster. In this case, different approaches can be

used to calculate the gain of MDP model (Sheskin, 2011).In our work, we calculate the expected average reward perperiod using the product of steady state probability and thereward vectors as described in (Sheskin, 2011), consideringa discount factor d = 1. In what follows, we provide theequations that are used to calculate the gain obtained fromchoosing a specific policy, such as a specific action for eachstate.

Let N be the number of states in a MDP and let Pdenote the matrix of transitions probabilities

P =

⎡⎢⎢⎢⎢⎣

p11 p12 ... p1N

p21 p22 ... p2N

....pN 1 pN 2 ... pNN

⎤⎥⎥⎥⎥⎦ , (1)

where pij denotes the probability of transiting from state ito state j.

The vector of steady state probabilities is given by

π = [π 1 π2 ... πN ], (2)

where

N∑i=1

πi = 1. (3)

To calculate π , we need to solve the following set ofequations:

π = π × P, (4)

with consideration of Eq. (3). Let R denote the matrix ofrewards


Dow

nloa

ded

by [

Mos

kow

Sta

te U

niv

Bib

liote

] at

14:

23 2

7 D

ecem

ber

2013

R =

⎡⎢⎢⎢⎢⎣

r11 r12 ... r1N

r21 r22 ... r2N

...rN 1 rN 2 ... rNN

⎤⎥⎥⎥⎥⎦ , (5)

where rij represents the reward for transiting from state i tostate j.

The reward vector

q = [q1 q2 ... qN ]

is given by

qi =N∑

j=1

pijrij , i = 1, 2, ..., N . (6)

The gain is calculated as

G =N∑

i=1

πiqi. (7)

After calculating G for each policy, the optimal policycan be determined by comparing the gains of all policiesand selecting the one with the maximum value.

4. THE OPTIMAL POLICYIn this section, we discuss all possible policies for the

honeypot and determine the optimal one under differentassumptions. In our model, we have three states with threepossible actions in each state. Thus, the combination of allpossible policies results in 27 possible policies. However, aswill be explained below, we do not have to investigate all ofthese policies since some of them are dominated by others.

At state D the only action we should consider is R; ahoneypot that has been disclosed by the botmaster is ofno use to the honeypot operators and must be reset. Also,when the honeypot is in the waiting state (W) and choosesaction N or R it will not be able to join the botnet andcannot provide security professionals with useful informa-tion. So the only logical action at state W is to allow thebotmasters to compromise the honeypot (action A). Thus,we are left with only three policies to choose from, i.e.,policies in which the honeypot always chooses action A atstate W and action R at state D. We denote these policiesby the name of the action used at state C :

• Policy A: Use action A at state C .

• Policy N : Use action N at state C .• Policy R: Use action R at state C .

To decide the optimal policy for the honeypot, we cal-culate the gain obtained in each policy using Eq. (7) anddenote it as Gs where s ∈ {A, N , R}. By calculating the gainfor each of the three policies, we have

1. Policy A:

GA = IV − CO − CL. (8)

The gain obtained by applying this policy can be posi-tive only if IV > CO + CL. This can be the case wherethe collected information is important, such as valuablecyber intelligence information or in the case of low legalliability where liability can by reduced by obtaining acourt permit.

2. Policy N :

GN = (Pa − PaPd )IV − (Pa + Pd )CO − PaPd CR

Pa + Pd + PaPd.

(9)

The value of GN becomes smaller when Pd increases.This is because the botmasters are more likely todisclose the honeypots and remove them from theirbotnets, which reduces information captured by thehoneypots.

3. Policy R:

GR = −CO − PaCR

1 + Pa. (10)

GR always has a negative value that represents a lossfor honeypot. Thus the policy R can be excludedsince policy N always provides a better reward to thehoneypot.

Deciding the optimal policy requires the knowledge ofall the system parameters such as costs and probabilities.Although it is possible to estimate all costs such as the costof running, maintaining and resetting the honeypot, thecost of information (based on its importance), and the costof liability (based on the illegal action that botmasters areperforming), determining the values of Pa and Pd is a rel-atively harder task. The probability of attack Pa can beestimated using experimental data collected over a suffi-cient period of time. However, estimating the probabilityof disclosure Pd is not easy since it is under the con-trol of the botmasters. On the other hand, analyzing the


Dow

nloa

ded

by [

Mos

kow

Sta

te U

niv

Bib

liote

] at

14:

23 2

7 D

ecem

ber

2013

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−5

0

5

10

15

20

Probability of Detection (Pd)

Pol

icy

Gai

n

GA

GN

GR

FIGURE 2 Changes in the gains (GA, GN, and GR) as a function of probability of disclosure (Pd).

developed model may provide the security professionalswith some guidelines to determine the optimal policy, forexample, suppose the system parameters have the values of(CO = 1, CR = 2, IV = 20, CL = 15 and Pa = 0.5).

Figure 2 shows the optimal policy for different valuesof Pd . In this scenario, security professionals must selectpolicy N when the probability of disclosure is assumedto be less than 0.4, and select Policy A when a highervalue of Pd is assumed. As mentioned earlier, policy Rcannot be optimal. When the probability of disclosure islow, honeypots will be able to stay for a longer time inbotnets and collect more information even when choos-ing the action of not executing the botmasters’ commands.However, with higher values of Pd , the optimal policy is toallow botmasters to execute their attacks from within thehoneypot since this will hide the honeypot true nature.

5. MODELING UNCERTAINTY OFTHE HONEYPOT STATE

When using MDP, the current state of the honeypot,for example, W , C , or D, must be known to the honeypotoperators in order to decide the best action to use. Todetermine the system current state, the honeypot needsto look for evidence and/or observations and interpretthem accordingly. Some evidence can be deterministic,

while others may have different interpretations that lead touncertainty about the system inner state and, consequently,lead to following nonoptimal actions. For example, theabsence of suspicious activities while the honeypot hasnot joined any botnet yet assures the operators that thehoneypot is in state W . On the other hand, the absenceof botmasters’ activities while the honeypot is in stateC can be due to the fact that either the botmasters areperforming other tasks, for example, expanding their bot-nets, or the honeypot’s true nature was disclosed by thebotmasters. Thus, in this case, it is not possible for thehoneypot operators to decide whether the system is in stateC or in state D. This uncertainty about the system statecannot be modeled using MDP. In this case, a more gen-eral concept, that is able to handle such uncertainty andyet is still capable of determining the optimal policy, isneeded. A POMDP (Cassandra, Kaelbing, & Littman,1994), which uses observations to calculate the probabilityof being in each of the system states, has those capabilitiesand can be applied to our model.

5.1. Partially Observable MarkovDecision Processes

The definition of POMDP is similar to MDP with theaddition of two parameters:


Dow

nloa

ded

by [

Mos

kow

Sta

te U

niv

Bib

liote

] at

14:

23 2

7 D

ecem

ber

2013

1. A finite set of observations, O.2. An observation function:

O: A × S �→ [Pr(O1), Pr(O2), ...],∑

i(Pr(Oi) = 1

In POMDP, after executing action aj−1 the system statesj may not be determined. Instead, we receive an obser-vation Oj and use it to calculate the probability of beingin each state. The probability distribution over all systemstates is called the belief state. Solving a POMDP involvesconverting it into MDP problem by replacing the sys-tem state with the belief state (Cassandra et al., 1994).However, this solution contradicts with the definition ofMDP in two aspects:

• Maintaining the Belief State: The entire history ofactions and observations is required to maintain thebelief state updated. This violates the Markovian prop-erty which requires that the next state must depend onlyon the current state and current action. However, asdescribed in Cassandra et al. (1994), the Bayes rule canbe used to update the belief state. By knowing the beliefvalue b(s)t for state s at time t, the taken action at andthe received observation Ot , the new belief value b(s)t+1

can be calculated as:

b(s)t+1 = Pr(o|s, bt , a) × Pr(s|bt , a)

Pr(o|bt , a). (11)

This enables us to use the belief state as our stateset, which converts the POMDP problem into a fullyobservable MDP.

• Continuity of Belief State Space: It is impractical tofind the optimal solution for all belief states as theirspace is continuous and has infinite number of possi-ble states (Cassandra et al., 1994). To overcome thisproblem, approximation algorithms (Cassandra, 2003;Kurniawati, HSU, & Lee, 2008) are used to find anapproximation to the optimal solution. All algorithmsuse value iteration to calculate the approximated solu-tion (Cassandra et al.) starting at an initial value functionV(b) for the initial belief state, and then iterating usingequation (12) as follows:

V ′(b) = maxa

[∑s

b[s]R[s, a] +∑

o

Pr(o|b, a)V (b′)]

(12)

where V ′(b) is the improved value function for beliefstate b, is the future discount factor and V (b′) is

the value of the resulting belief state. The differencesbetween the algorithms used to solve the POMDP residein the way they sample the belief space to reach theoptimal solution (Cassandra, 2003; Kurniawati et al.,2008).

5.2. POMDP ModelTo model uncertainty, we can include different observa-

tions to the system and use them in solving the resultingPOMDP problem. For example, we can have the follow-ing three observations to monitor and calculate the systembelief state:

1. Unchanged Honeypot: Honeypots are used only tocollect information about attackers and have no pro-ductivity purpose. When attackers penetrate honeypots,they leave traces, such as changes in log files, down-loaded files, and other activities. If no changes areobserved to the honeypot, this indicates that it is stillin the waiting state W . We refer to this observation asUnchanged .

2. Absence of Botmasters’ Commands: Botmasters aresupposed to make use of their victim machines afterturning them into bots. After compromising thehoneypot, if botmasters do not send commands tothe honeypot, it is possible that they have detectedits true nature and disconnect it from their botnet.However, it is also possible that botmasters are busy fora while doing other things such as compromising othermachines to expand the botnet. Thus, this observationleads to uncertainty in determining the honeypot state,whether it is in state C or in state D. We refer to thisobservation as Absence.

3. Receiving Botmaster Commands: When the honeypotreceives commands from the botmasters, after beingcompromised, it is clear that they still consider it asa part of their botnet. This indicates that honeypot isnot disclosed yet and still in state C . We refer to thisobservation as Commands.

6. SIMULATION RESULTSIn this section, we present and discuss our simulation

results. We show how to determine the honeypot optimalpolicy based on different configuration parameters, and theeffect of these parameters on the honeypot outcome, thatis, expected reward. We explain how to solve our modelby using an example of high liability cost scenario, and weshow the effect of changing the probability of attack, Pa,


Dow

nloa

ded

by [

Mos

kow

Sta

te U

niv

Bib

liote

] at

14:

23 2

7 D

ecem

ber

2013

and the probability of disclosure, Pd , on both the expectedreward and the optimal policy.

To solve the POMDP model, we use the ApproximationPOMDP Planning Toolkit (APPL) (Du et al., n. d.),which allows us to find the optimal policy, generate anoptimal policy graph, and simulate its expected reward.The APPL toolkit uses Successive Approximations of theReachable Space under Optimal Policies (SARSOP) algo-rithm (Kurniawati et al., 2008) to approximate the solu-tion of the POMDP model. In short, SARSOP is one ofthe algorithms that focuses on reachable points of the beliefspace. In general, most points of the continuous beliefspace cannot be reached starting from a given initial belief.This makes the sampling more efficient and helps solvingthe POMDP in a shorter time, especially for problems withlarger sets of states. Furthermore, SARSOP tries to deter-mine the optimally reachable belief space (the space thatcontains only the points needed for optimal solution) byapplying a technique of learning-enhanced exploration andbounding (Kurniawati et al., 2008).

6.1. Scenario of HighLiability Cost

In what follows, we assume the following parameters:

1. Transition Probabilities: The probability of attack Pa =0.7 and the probability of disclosure Pd = 0.6

2. Costs: The operation cost CO = 1, the reset cost CR =2, the information cost IV = 10, and the liability costCL = 15.

3. Observation Probabilities:• When the system is in state W : The only

observation that exists is Unchanged as the honeypothas not joined any botnet yet. In this case,Pr(Unchanged ) = 1.0, Pr(Commands) = 0.0, andPr(Absence) = 0.0.

• When the system is in state C : When compromised,observation Unchanged does not happen. Only theother two observations (Commands and Absence)can be observed. In this case, Pr(Unchanged ) =0.0. We assume that Pr(Commands) = 0.7 andPr(Absence) = 0.3.

• When the system is in state D: Only the obser-vation Absence can be observed as no furthercommands from the botmaster are received. Also,Unchanged cannot be observed as the honeypot hasalready been compromised. Thus, in this case, wehave Pr(Unchanged ) = 0.0, Pr(Commands) = 0.0,and Pr(Absence) = 1.0.

State: WPr(W) = 1 Action: Allow

Unchanged

State: C Pr(C) = 1Action: Not Allow

Commands Absence

Commands

State: DPr(D) = 0.833 Action: Reset

Absence

Unchanged

FIGURE 3 An example for a honeypot optimal policy graphrepresentation in a high liability scenario (Pa = 0.7, Pr = 0.6).

Figure 3 shows the optimal policy for the assumed setof parameters. The system starts in state W with probabil-ity 1 and chooses the optimal action A to allow botmastersto compromise the honeypot. If the honeypot is compro-mised, the system will certainly transit to state C withprobability 1, in which the optimal action is N , and staysthere as long Commands is observed. Upon receiving obser-vation Absence, the system is considered to be in D, withprobability of 0.833, in which the optimal action is toreset the honeypot to its initial state W . To summarize,in this scenario, the optimal policy for the honeypot is touse action A when observing Unchanged , to use action Nwhen moving from state C (regardless of the observations)and to use action R when having the observation Absenceafter using the action N .

6.2. Determining the Optimal Policiesand Calculating the Expected

Rewards for Different Values of Paand Pd

In what follows, we study the effect of the parameterschange on the expected reward. In particular, we investi-gate the effect of changing the probability of attack Pa andthe probability of disclosure Pd on the expected reward ofthe system. Figure 4 shows the changes of the expectedreward for a particular set of cost values when both Pa


Dow

nloa

ded

by [

Mos

kow

Sta

te U

niv

Bib

liote

] at

14:

23 2

7 D

ecem

ber

2013

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−100

−50

0

50

100

150

200

Probability of Attack (Pa)

Exp

ecte

d R

ewar

dPa = 0.2

Pa = 0.4

Pa = 0.6

Pa = 0.8

Pd = 0.2

Pd = 0.4

Pd = 0.6

Pd = 0.8

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−100

0

100

200

300

400

500

600

Probability of Disclosure (Pd)

Exp

ecte

d R

ewar

d

FIGURE 4 The system expected reward increases with higher probability of attack, Pa, and decreases with higher probability ofdisclosure, Pd .

and Pd change in the range [0,1]. As depicted in the fig-ure, the value of the expected reward increases when theprobability of attack increases. A higher Pa means moreattacks are launched to compromise the honeypot and con-sequently more information is collected by the honeypotat each time interval; that is, the honeypot will spend lesstime in W state and be able to gain the reward IV moreoften. On the other hand, when the probability of disclo-sure Pd increases, the expected reward decreases. A higherPd means that the botmasters are more likely to disclosethe true nature of the honeypot and remove it from theirbotnet, which makes the honeypot less efficient because ofthe reduction in the collected information. We also noticethat higher expected rewards come with a low probabilityof disclosure even when Pa has a low value. This is becausewith low disclosure probability, honeypots will prolongtheir stay in the botnet and collect more information inthe long term. Security professionals should consider thisresult when trying to attract more attackers by increasingPa, for example, by increasing the attack surface, as thismay draw the attackers’ attention to consider this machineas a possible honeypot and consequently increase Pd . A bal-ance between Pa and Pd is important when the honeypotis required to stay in the botnet for a longer time.

Figure 5 shows the optimal policies for two sets ofparameters that differ only in the value of probability ofattack (in Figure 5-a, Pa = 0.1 and in Figure 5-b, Pa =0.5). As depicted in Figure 5a, due to the low value ofPa, the system is trying to capture more information bychoosing the action N when the probability of being inD is higher than the probabilities of being in other states.Based on the received observation, the system may be

in state C and collect more information or in state D(with higher probability). If the probability of being inD is very high then the system chooses the action R andreset to the initial state W . For Pa = 0.5, in Figure 5b,the system directly chooses the action R when the prob-ability of being in D is higher than the probability ofbeing in other states. This is due the higher probabil-ity of attack, which makes it more rewarding to resetthe system and wait for new attacks rather than hop-ing for the current attackers to make new interactionswith the honeypot after observing the absence of theircommands.

6.3. Determining the SystemExpected Reward for Different

Observation ProbabilitiesWe study the impact of changing the probabilities of

observations on the expected reward of the system. In par-ticular, we study the effect of changing the probabilityof having observations: Commands and Absence, afterexecuting the action N in state C . To do so, we usethe values of the remaining parameters as follows where:CO = 1, CR = 2, CL = 15, IV = 10, Pa = 0.8, Pd = 0.5.In this scenario, it is expected to have either observa-tion Commands or observation Absence in state C withprobabilities Pr(Commands) and Pr(Absence) consecutively,where Pr(Commands) + Pr(Absence) = 1. Figure 6 showsthe effect of changing these probabilities on the expectedreward of the system.

From Figure 6, we can notice that the expected rewardof the system increases with higher probability of having


Dow

nloa

ded

by [

Mos

kow

Sta

te U

niv

Bib

liote

] at

14:

23 2

7 D

ecem

ber

2013

State: WPr(W) = 1Action: Allow

Unchanged

State: CPr(C) = 1Action: Not Allow

Commands Absence

Commands

State: DPr(D) = 0.886Action: Not Allow

Absence Commands

State:DPr(D) = 0.989Action: Reset

Absence

Unchanged State: WPr(W) = 1Action: Allow

Unchanged

State: CPr(C) = 1Action: Not Allow

Commands Absence

Commands

State: DPr(D) = 0.886Action: Reset

Absence

Unchanged

(a) Pa = 0.1 (b) Pa = 0.5

FIGURE 5 Examples for the optimal policy graph for different values of probability of attack, Pa.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.935

40

45

50

55

60

65

70

Probability of Observing ‘Commands’

Exp

ecte

d R

ewar

d

Expected Reward

FIGURE 6 The system expected reward increases with higher probability of observing Commands in state C after executingthe action N .


Dow

nloa

ded

by [

Mos

kow

Sta

te U

niv

Bib

liote

] at

14:

23 2

7 D

ecem

ber

2013

observation Commands. This is due to the fact that higherprobability for observing Commands represents more pos-sible interactions with the botmasters. Thus, our model isexpecting the system to receive IV with more probability ateach interaction.

REFERENCESAbbas, Z., and Li, F. (2012). Energy optimization in cellular networks with

micro-/pico-cells using Markov decision process. European Wireless,2012. EW. 18th European Wireless Conference, Poznan, Poland. IEEE,pp. 1–7.

Cassandra, A., Kaelbling, L., and Littman, M. (1994). Acting optimallyin partially observable stochastic domains. American Association forArtificial Intelligence, 1023–1028.

Cassandra, A. (2003). Pomdps: Who needs them? Retrieved from www.pomdp.org/pomdp/talks

Du, Y., Hsu, D., Huang, X., Kurniawati, H., Sun Lee, W., Ong, S.,and Png, S. (n.d.). Approximate pomdp planning software (appl.).Retrieved from http://bigbird.comp.nus.edu.sg/pmwiki/farm/appl.

Feily, M., Shahrestani, A., and Ramadass, S. (2009). A survey of bot-net and botnet detection. Emerging Security Information Systems andTechnologies, 268–273.

Ferrie, P. (2006). Attacks on more virtual machine emulators. SymantecAdvanced Threat Research. Retrieved from http://pferrie.tripod.com/papers/attacks2.pdf

Fu, X., Yu, W., Cheng, D., Tan, X., Streff, K., and Graham, S.(2006). On recognizing virtual honeypots and countermeasures. IEEEInternational Symposium on Dependable, Autonomic and SecureComputing, Indianapolis, IN. IEEE, pp. 211–218.

Hayatle, O., Youssef, A., and Otrok, H. (2012). Dempster-shafer evidencecombining for (anti)-honeypot technologies. Information SecurityJournal: A Global Perspective, 21(6), 306–316.

Jha, S., Sheyner, O., and Wing, J. (2002). Two formal analysis of attackgraphs. Computer Security Foundations Workshop, Cape Breton,Nova Scotia, Canada. IEEE, pp. 49–63.

Krawetz, N. (2004). Anti-honeypot technology. IEEE Security & PrivacyMagazine, 76–79.

Kreidl, O. (2010). Analysis of a Markov decision process model forintrusion tolerance. Dependable Systems and Networks Workshops,156–161.

Kurniawati, H., Hsu, D., and Lee, S. (2008). Sarsop: Efficient point-basedpomdp planning by approximating optimally reachable belief spaces.Robotics: Science and Systems IV.

Liu, J., Yu, F.R., Lung, C., and Tang, H. (2009). Optimal combinedintrusion detection and biometric-based continuous authentication inhigh security mobile ad hoc networks. IEEE Transactions on WirelessCommunications, 806–815.

Provos, N. (2004). A virtual honeypot framework. In Proceedings of the13th conference on USENIX Security Symposium, 13, 1–14.

Puterman, M.L. (1994). Markov decision processes: Discrete stochasticdynamic programming. New York, NY: Wiley.

Petrik, M., Taylor, G., Parr, R., and Zilberstein, S. (2010). Feature selec-tion using regularization in approximate linear programs for Markovdecision processes. The 27th International Conference on MachineLearning, Haifa, Israel, pp. 871–878.

Shirazi, G., Kong, P., and Tham, C. (2009). Cooperative retransmis-sions using Markov decision process with reinforcement learning. IEEE20th Interantional symposium on Personal, Indoor and Mobile RadioCommunications, Tokyo, Japan. IEEE, pp. 652–656.

Sheskin, T. (2011). Markov Chains and Decision Processes for Engineersand Managers. Boca Raton, FL: CRC Press.

Taibah, M., Al-Shaer, E., and Boutaba, R. (2006). An architecture foran email worm prevention system. Securecomm and Workshops,Baltimore, MD. IEEE, pp. 1–9.

Wagener, G., State, R., Dulaunoy, A., and Engel, T. (2009). Self adap-tive high interaction honeypots driven by game theory. Symposiumon Stabilization, Safety, and Security, Lyon, France. Springer Verlag,pp. 741–755.

Zou, C., and Cunningham, R. (2006). Honeypot-aware advanced botnetconstruction and maintenance. Dependable Systems and Networks,International Conference on Dependable Systems and Networks(DSN-2006), Philadelphia, PA. IEEE, pp. 199–208.

BIOGRAPHIESOsama Hayatle is currently working towards his

master’s degree at the Concordia Institute for InformationSystems Engineering (CIISE), Concordia University,Montreal, Canada. He received his bachelor’s degreein 1999 from the College of Electrical Engineering,Information Systems Department, Aleppo University,Aleppo, Syria. His main research interests are in the areaof network security and botnets.

Hadi Otrok holds an assistant professor positionin the Department of Computer Engineering at KhalifaUniversity. He received his Doctorate in Electrical andComputer Engineering (ECE) from Concordia University,Montreal, Canada. His research interests are mainly inthe area of network and computer security. Also, he hasinterests in resources management in virtual private net-works and wireless networks. His doctoral thesis was onIntrusion Detection System (IDS), using Game Theoryand Mechanism Design. While obtaining his master’sdegree, he worked on Security Testing and Evaluationof Cryptographic Algorithms. Before joining KhalifaUniversity, Dr. Otrok was holding a postdoctoral posi-tion at the École de Technologie Supérieure (Universityof Quebec). He is serving as a technical program commit-tee member for different international conferences and is aregular reviewer for different specialized journals. Also, heco-chaired several security related conferences.

Amr Youssef received his Bachelor of Science andhis Master of Science degrees from Cairo University,Cairo, Egypt, in 1990 and 1993, respectively, and hereceived his doctorate degree from Queens University,Kingston, ON, Canada, in 1997. Dr. Youssef is cur-rently a professor at the Concordia Institute forInformation Systems Engineering (CIISE) at ConcordiaUniversity, Montreal, Canada. Before joining CIISE,he worked for Nortel Networks, the Center forApplied Cryptographic Research at the University ofWaterloo, IBM, and Cairo University. His research inter-ests include cryptology, network security, and wirelesscommunications.


Dow

nloa

ded

by [

Mos

kow

Sta

te U

niv

Bib

liote

] at

14:

23 2

7 D

ecem

ber

2013

A Markov Decision Process Model for High Interaction Honeypots

Documents

Transcript of A Markov Decision Process Model for High Interaction Honeypots