Detection of fast- ux botnets through DNS tra c analysises3ce/Elahe...

Scientia Iranica D (2015) 22(6), 2389{2400

Sharif University of TechnologyScientia Iranica

Transactions D: Computer Science & Engineering and Electrical Engineeringwww.scientiairanica.com

Detection of fast- ux botnets through DNS tra�canalysis

E. Soltanaghaei and M. Kharrazi�

Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.

Received 31 July 2014; received in revised form 2 February 2015; accepted 15 September 2015

KEYWORDSBotnets;Bot;C&C channel;Fast- ux;IP- ux,DNS server.

Abstract. Botnets are networks built up of a large number of bot computers, whichprovide the attacker with massive resources, such as bandwidth, storage, and processingpower, in turn, allowing the attacker to launch massive attacks, such as Distributed Denialof Service (DDoS) attacks, or undertake spamming or phishing campaigns. One of the mainapproaches for botnet detection is based on monitoring and analyzing DNS query/responsesin the network, where botnets make their detection more di�cult by using techniques suchas fast- uxing. Moreover, the main challenge in detecting fast- ux botnets arises from theirsimilar behavior with that of legitimate networks, such as CDNs, which employ a round-robin DNS technique. In this paper, we propose a new system to detect fast- ux botnets bypassive DNS monitoring. The proposed system �rst �lters out domains seen in historicalDNS traces assuming that they are benign. We believe this assumption to be valid as benigndomains usually have long lifetime as compared to botnet domains, which are usually short-lived. Hence, CDN domains, which are the main cause of misclassi�cation when looking formalicious fast- ux domains, are removed. Afterwards, a few simple features are calculatedto help in properly categorizing the domains in question as either benign or botnet related.The proposed system is evaluated by employing DNS traces from our campus network andencouraging evaluation results are obtained.© 2015 Sharif University of Technology. All rights reserved.

1. Introduction

One of the most important and prominent sources ofthreats on the internet are botnets. These networks arebuilt up of a large number of bot computers which pro-vide the attacker with massive resources such as band-width, storage, and processing power. In turn, withsuch resources, the attacker will have the capabilityto launch massive attacks such as Distributed Denialof Service (DDoS) attack, or undertake spamming orphishing campaigns. The main characteristic whichdistinguishes bots from regularly infected machinesis their ability to coordinate their action through

*. Corresponding author. Tel.: +98 21 66166627E-mail addresses: [email protected] (E.Soltanaghaei); [email protected] (M. Kharrazi)

a Command and Control (C&C) channel under thecontrol of a bot master (i.e., the attacker).

There are two main approaches in detectingbotnets. One is based on monitoring and analyz-ing network tra�c, looking for signs that indicatea computer is connected to a C&C server or otherbehavior indicative of the presence of a bot in thenetwork. The other approach proposed by researchersis based on analyzing DNS query/responses in thenetwork. The importance of such an approach wouldbe better understood considering the fact that botsneed to locate the C&C server by resolving its domain.Hence, by monitoring this channel, abnormal DNSquery patterns could be detected, potentially beforethe bot communicates with the C&C server. Eachapproach has its advantages and disadvantages. Themain advantage of network tra�c analysis is that it

2390 E. Soltanaghaei and M. Kharrazi/Scientia Iranica, Transactions D: Computer Science & ... 22 (2015) 2389{2400

enables the detection of some types of botnets, such aspeer to peer botnets, which cannot be detected by DNStra�c analysis. Nevertheless, analyzing large volumesof network tra�c has its own di�culties. Furthermore,if bots are encrypting their communication to the C&Cserver, then techniques based on analyzing the contentof packets sent over the network would not work. Thisis while DNS tra�c cannot be encrypted. Hence, inthis work we concentrate on a detection scheme basedon monitoring DNS tra�c.

Moreover, attackers take di�erent defensive mea-sures to make their botnets more resilient to detectionand, in turn, take-down by network providers. Onewidely used approach is fast- uxing, in which thedomain name for the C&C server resolves multiple IPaddresses, leading to di�erent copies of the C&C server.Hence, the tra�c of the botnets is distributed amongdi�erent C&C servers, in turn, making their detectionmore di�cult. The main challenge in detecting fast- ux botnets is that their behavior is very much similarto that of the legitimate networks using the Round-Robin DNS technique (RRDNS) [1]. As shown in [2],due to the importance of availability in internet ser-vices, large websites use the RRDNS method to dis-tribute the incoming tra�c between di�erent servers.This method is also used in Content DistributionNetworks (CDNs) and makes their behavior similarto that of the fast- ux botnets. Therefore, using theRRDNS method by the large CDNs causes di�culty inaccurate detection of fast- ux botnets.

In this paper, we introduce a new system todetect malicious fast- ux domains by passive DNSmonitoring. The proposed system has two mainmodules for analyzing the behavior of fast- ux botnets.First, it removes a high percentage of benign domains,including those of the CDNs, with the aid of historicaltra�c. The main idea employed in this module is theshort lifespan of botnet domains, as noted in [3], incomparison with the benign CDN domains. Thus, byeliminating the domains present in historical tra�c, theCDN domains will be removed (because of their longlifespan) and malicious domains with short lifespanswill remain for further analyses. Furthermore, thispaper provides an e�cient implementation for theusage of historical tra�c as white-list. The secondmodule uses two fundamental attributes of fast- uxdomains (high ux of resolved IPs and small TTLvalues) in the form of probability functions to deter-mine the ux rate for each domain. In summary, thecontributions/highlights of this work include:

� This paper presents a novel approach for detectionof malicious fast- ux domains. The proposed systemcalculates the ux rates of a domain by determin-ing the ux probability functions which are basedon fundamental characteristics of fast- ux domains

(i.e., high uxing of IPs). As this approach relies onbasic features, it can detect di�erent kinds of fast- ux domains without relying on information from aspeci�c family of botnets.

� This paper employs historical tra�c as a white-list. Using historical tra�c not only reduces thesize of tra�c being analyzed, but also eliminatesthe CDN domains which are the main cause of mis-classi�cation when looking for malicious fast- uxdomains. The Bloom �lter data structure is used tostore historical tra�c with minimal storage space.

� The proposed system adopts Sequential Probabil-ity ratio Testing (SPRT) as a statistical methodto provide online detection. The SPRT methodaggregates the results of ux probability functionsof each domain in sequential time windows.

� The proposed system is evaluated by employingDNS traces from our campus network. Furthermore,it is shown that the proposed system can achieve a94.44% detection rate and 0.001% false positive rate.

The remainder of this paper is organized asfollows, Section 2 reviews related works. In Section 3,the proposed system is explained in detail. Section4 evaluates the performance of the proposed system.Also, it provides a short explanation about the speci-�cities of the system implementation. Results arediscussed in Section 5, and the manuscript is concludedin Section 6.

2. Related works

There has been much prior work in the �eld of botnetdetection. A number of approaches monitor andanalyze network tra�c [4-7], whereas techniques suchas those in [8-12] concentrate on DNS tra�c. In thispaper, we focus on DNS-based detection techniques,which analyze fast- ux botnets, and consider proposalson analyzing domain- uxing techniques as out of scope.

A group of studies has analyzed the behavior ofbotnets and their characteristics. Konte et al. [13] showthat despite the similarity of the TTL values in fast- ux networks and CDNs, the operations of the twonetworks are di�erent. For instance, the IP distributionof malicious fast- ux networks has a larger variancethan that of the content distribution networks. In 2009,Caglayan et al. [3] observed that fast- ux domainsusually have a short lifespan of about 2 weeks to 2months.

The �rst fast- ux detection system among activeapproaches was proposed by Holz et al. [8]. This systemcalculates a ux score for each domain based on thenumber of distinct A records and the number of NSrecords for a domain. One of the weaknesses of thisapproach is that the features used cannot distinguish

E. Soltanaghaei and M. Kharrazi/Scientia Iranica, Transactions D: Computer Science & ... 22 (2015) 2389{2400 2391

fast- ux domains from CDN domains properly, whichresults in false positives. A number of approaches, suchas those in [14,15], have been built on the work of Holzet al., wherein initially suspicious domain names aregathered from spam emails and then those domainsare monitored by active probing. The main drawbackof such approaches is that they rely on active DNSprobing; furthermore, the domains they monitor arelimited to those observed in spam emails. Moreover,active probing DNS-based techniques, such as thosenoted above, create excessive network tra�c and maybe detected by an attacker.

Perdisci et al. [9] proposed a detection techniquebased on passive monitoring of DNS tra�c withoutlimiting the detection technique to domain namesextracted from spam emails and black-lists. Theycalculated the similarity between the resolved IP sets ofdi�erent domains in order to cluster domains belongingto the same service, and then, based on a set of activeand passive features, classi�ed the clusters as eitherbenign or malicious. Later, Perdisci et al. [10] proposedFluxBuster, which was built on their earlier work, and,in essence, removed the dependency on active featuresand relied solely on passive features.

Antonakakis et al. [11] proposed Notos, withwhich reputation scores are assigned dynamically todomain names based on their level of malicious ac-tivities. Notos uses the passive DNS analysis andit assigns a low reputation score to the maliciousdomains according to the network- and zone-basedfeatures. The network-based features determine the uxy behavior of each domain and zone-based featuresdistinguish benign CDN domains from malicious ones.As a limitation, Notos is not able to assign a correctreputation score to the domain names with very littlehistorical information and it requires a large passiveDNS collection. In [12], a system called Exposureis introduced which employs a passive DNS analysistechnique to detect malicious domains. Exposure triesto detect abrupt changes in the number of requests tothe domains which show abnormal behaviors.

Recently, Choi and Lee [16] have proposed amechanism called botGAD by monitoring maliciousgroup activities in the DNS tra�c. To �nd the groupactivities, botGAD measures similarity of di�erent do-mains by building a matrix based on features extractedfrom DNS traces. However, as the analysis is performedperiodically and in independent time windows, botnetscan circumvent detection by delaying communicationand spreading their tra�c over multiple time windows.

Most of the previously proposed techniques couldbe circumvented by some changes in the activities ofbots, like producing noisy tra�c (i.e. generating fakeDNS queries), sub-grouping members of a botnet, etc.improving the methods of IP assignments. Given theshortcomings present in previous approaches in fast-

ux botnet detection, we propose a new detectionmechanism based on the fundamental and inherentcharacteristics of fast- ux botnets. We should notethat as the proposed approach analyzes domains seenin requests, independently, it is not a�ected by thebehavior of hosts in the network or other domainsqueried, as are some of the previously proposed tech-niques.

3. System overview

One of the main challenges in detecting malicious fast- ux domains is the similarity between such domainsand domains employed in CDNs. These legitimate net-works distribute the tra�c loads of a website betweenseveral servers by using the round-robin DNS tech-nique, where a group of IPs are assigned to a domainname and a permutation is returned in response to eachrequest. The used IP sets have small TTL values inorder to control the load on each server, while changingtheir permutation in sequential requests. Similarly,these two features (the large number of resolved IPsand small TTL values) are used in malicious fast- uxnetworks. Hence, distinguishing malicious domainsfrom benign domains becomes a challenge.

Nevertheless, with a closer examination in [8],it is shown that IP allocation is done di�erently inCDNs vs. fast- ux networks. In the CDNs, the listof resolved IPs is generally �xed and their permutationis changed as requests come in, while in the fast- uxnetworks, the list of resolved IPs can be di�erent ineach response. The reason is that compromised serversare used for hosting malicious domains and they couldget turned o� by their owners or cleaned up. Hence,alternate compromised servers are used. In addition,these compromised servers are distributed in di�erentsubnets. Therefore, there is no relationship betweenIPs returned by the fast- ux networks.

In the rest of this section, the proposed systemis discussed in details. The main objective of theproposed method is to detect fast- ux botnet domains.Section 3.1 describes the overall architecture of theproposed system. In Sections 3.2 and 3.3, two featuresof the system are presented including Bloom �lterapplication in history storage and de�ning probabilityfunctions for showing malicious rate of a domain.Then, Section 3.4 describes the application of SPRTalgorithm and how it is employed in the proposeddetection technique.

3.1. System architectureThe proposed system uses three basic attributes ofmalicious fast- ux domains: (1) large number of re-solved IPs, (2) small TTL values, and (3) di�erentIP allocation methodology in malicious fast- ux andCDN domains: di�erent IP sets in sequential responses


vs. repeating IP sets. The �rst and third attributesare used in de�ning ux probability functions and theTTL value is checked as the last �lter in the detectionmodule.

More speci�cally, there are two main modules of(1) storing history, and (2) online tra�c analysis withthe aim of botnet detection. The system architectureis shown in Figure 1. Figure 1(a) includes two sub-processes of \Tra�c Parser" and \History Saver". Inthe \Tra�c Parser", the DNS history tra�c is collectedand the required information such as domain names areextracted. Afterwards, the \History Saver" stores theextracted domain names within the Bloom �lter datastructure. The utilization of this data structure willreduce the volume of data stored as well as the search

time overhead. (Further details on Bloom �lters aredescribed in Section 3.2.)

Figure 1(b) illustrates the detection service com-ponents. In this step, the collected DNS traces areparsed and the required information is extracted, i.e.\Tra�c Parser". Then the extracted domains arecompared with those stored in the Bloom �lter, i.e.\Bloom Filter Checker". If a domain name exists inthe history, then it does not belong to a maliciousIP ux network and can be discarded. The domainnames, that do not exist in the Bloom �lter, areinserted into the hash map data structure built on topof the data structure proposed in [16]. As shown inFigure 2, the hash map includes a domain map thathas a domain name as a key and a requesting IP map

Figure 1. The overall architecture of the proposed system

Figure 2. The hashmap data structure.


as a value. Requesting IP maps have an IP address ofthe requesting host as a key and a resolved IP map asa value. This structure is repeated in the next map,in which the key is the resolved IP address and theinformation list consists of the TTL and time stampsas values. So, each DNS query constitutes a branch ofthis tree data structure. In the next step, the \FeatureGenerator" block extracts the required features. Theprobability functions measure the amount of resolvedIP growth for the domains which is in fact the valueof the ux rate. Then, the SPRT analyzer performsa hypothesis test with the aim of online detection ofmalicious domains. Detected domains are stored in thedatabase and suspicious domains and their value of IPgrowth are stored in an alternate database to be usedin later analysis.

The owchart of the proposed system is shown inFigure 3. In the �rst step, the history tra�c is stored inthe proper data structure. Then, the analyzed tra�c isgathered in a preset time window and compared withthe white-list. In the next step, the remaining domainsare compared with the history. Hence, both domainsfound in the white-list and history are removed. Atthe end of a time window, the proper features areextracted and two probabilistic functions are calculatedto determine ux rates of a domain name. In the�nal step, a sequential probability testing is used formaking a decision about each domain according to thevalue of probabilistic functions. This statistical methodcan produce three di�erent outputs of \malicious",\benign", and \Inadequacy of information" for every

Figure 3. The proposed system owchart.

analyzed domain. If the information is not su�cient,the domain name will be stored for examination in thenext time window. Furthermore, if the output of thesequential testing is malicious, the TTL condition willbe checked. Hence, a domain will be recognized asmalicious if it has three characteristics of a maliciousdomain explained earlier in this section.

3.2. Historical data storageThe proposed system uses historical tra�c to removebenign domain names, including CDNs. But, thereare two challenges in using historical tra�c for CDN�ltering. First, the volume of historical tra�c wouldbe quite large and a proper storage methodology wouldbe required. Second, the comparison of current tra�cwith the large volume of history needs an optimizedsearch algorithm in order to check the existence of eachrequested domain with minimal time overhead. Tosolve these challenges, we employed the Bloom �lterdata structure [17], which is able to provide e�cientstorage and searching with an adjustable false positiveand a false negative error rate of zero.

The Bloom �lter data structure works by storingeach domain name into a number of bits in a bit arrayinitialized to all zeros. This is done by using one ormore hash functions to indicate the related indicesin the array which will be set to one. After storingmultiple domains, multiple locations in the bit arraywould have a value of one. In the search phase, thedomain being searched is hashed and it is checked if allthe corresponding locations in the bit array are set toone; if that is the case, then it can be concluded thatthe domain in question has been previously stored inthe bloom �lter.

Furthermore, the Bloom �lter data structureprovides the ability to determine the size of storagearray according to the value of false positive errorrate. Where false positives may occur when a setof indices corresponding to a domain are changed toones, when other domains cause the change at thoseindices. Moreover, Bloom �lters do not have a falsenegative error rate. In other words, if a domain nameis inserted into the storage array, the search resultsfor that domain name will always be positive. Thisattribute is essential in the performance of the proposedsystem, because it could guarantee that all the domainspresent in the history �les, including CDNs, will be�ltered accurately. This would reduce the overall falsepositive error rate of the detection system.

3.3. Flux rate of a domainAs noted earlier, after comparison of the domain nameswith history and storing the required information in thehash map, the desired features are extracted in order tocalculate the ux rate of the domains. Generally, the ux rate quanti�es the change/growth in the resolved


Table 1. Features used.

Row Feature Description

1 #SReq Number of single-IP requests for a domain2 #MReq Number of multiple-IP requests for a domain3 #SResIP Number of distinct resolved IPs in single-IP type for a domain4 #MResIP Number of distinct resolved IPs in multiple-IP type for a domain5 MReqSize The average number of resolved-IP in multiple-IP requests of a domain6 #FirstMResIP The number of resolved IPs in the �rst packet of a domain name in a time window7 TTL value The average value of time to live of a resolved IP for a domain

Table 2. Examples of DNS single and multiple response.

1 Single domain www.hamunshop.ir 76.164.198.3

2 Multiple domain

forthworth.biz 216.239.34.21




IPs for each domain. The list of the required featuresare shown in Table 1. The two main features are thenumber of requests and the number of distinct resolvedIPs for a domain.

The de�nition of the IP growth of a domain di�ersbased on the number of resolved IPs in each DNSresponse. As shown in Table 2, DNS response packetscan include one or more resolved IPs. If the packetconsists of one resolved IP, IP growth can be calculatedby the division of the number of distinct resolved IPs tothe number of requests. But if the packet includes morethan one resolved IP, this de�nition does not correctlycalculate the IP growth rate, because the multitude ofIPs in one packet implies false growth. For example, ifeach DNS response related to a domain contains fourdi�erent IPs and the same IPs are replied in response tothe next DNS request, this domain does not have anyIP growth; while based on the previous de�nition, theIP growth equals to 4 resolved IPs divided by 2 queriesor, in other words, 200%, which is not correct. To solvethis confusion and as shown in Table 1, we de�ne twoIP growth functions for the two groups of Single-IPDNS responses and Multiple-IP DNS responses.

The ux rate of single-IP (noted as \single")responses is de�ned using Eq. (1). The �rst factorcalculates IP growth based on the number of distinctresolved IPs among all responses and it implicitly showsthe ux rate of single responses in a time window.The second factor, as introduced in [9], is a sigmoidalweight that measures the con�dence of the IP growthvalue. Because, while the values of \SResIP" and\SReq" might be small, the IP growth value may stillbe high. Therefore, this factor reduces the con�dence ofthe IP growth when the resolved IP sets or request setsare small. For example, if #SResIP=3 and #SReq=5

or #SResIP=30 and #SReq=50, the SIP value is 0.6 inboth cases, but the con�dence in the ux rate value inthe second case is higher. The value of is consideredto be 3 by using the same logic proposed in [9].

SIP =#SResIP

#SReq� 1

1 + e( �min(#SResIP;#SReq)) :(1)

Similarly, the IP growth rate of multiple-IP (noted as\multiple") responses of a domain is calculated usingEq. (2). The main purpose of this equation is tocalculate the average number of new IPs of a domainresolved by each DNS response. To achieve this aim,the resolved IPs of the �rst DNS response for a domainis stored in a time window and by comparison withnext IPs, the number of new IPs in the next responsesare measured. This value constructs the numerator ofIP growth function. Then, by dividing this value to thenumber of DNS responses during the time window, theaverage number of new IPs in each DNS response willbe calculated. Therefore, Eq. (3) de�nes the multiple ux rate of a domain. The �rst factor calculates the ux rate by dividing the IP growth value to the averagesize of multiple DNS responses and the second factoris the con�dence weight. The best value for isconsidered to be 2 based on the same logic proposedin [9].

IP Growth =#MResIP �#FirstMResIP

#MReq � 1; (2)

MIP =IP GrowthMReqSize

�1

1 + e( �min(MReqSize;#MReq)) : (3)

3.4. Online detection by sequential testingMost of the previous detection techniques noted earlierwould require long intervals of observed tra�c in orderto detect a botnet in the network. Our goal in theproposed technique was to limit this time delay, and,when possible, report the existence of a bot in thenetwork. As such, the tra�c is analyzed in shorttime windows and the IP growth is calculated for


the requested domains independent from other timewindows. Then, the calculated values are combinedwith those of the previous time windows for eachdomain in order to improve the con�dence of thesystem output. To that aim, the Sequential ProbabilityRation Testing (SPRT) is used as a statistical method,with which the current IP growth value of a domain isadded to those observed in previous time windows.

The de�nition of this statistical method, basedon the International encyclopedia of statistical sci-ence [18], is as follows. There are two hypotheses,denoted as H0 and H1, which we consider for eachdomain under consideration, where they correspond tobenign and malicious domains, respectively. For eachhypothesis, a Probability Density Function (PDF) isde�ned like Eq. (4) in which S1 is a random variable instate 1.

f1(S1) = K1; f0(S1) = K0; (4)

K0 and K1 denote the probability of a domain beingbenign or malicious, respectively. In the proposedsystem, the PDFs of f0 and f1 are considered as follows:

f0(Si) = 1� SIP (or MIP );

f1(Si) = SIP (or MIP ): (5)

By considering S1; S2; :::; Sn as the samples of di�erenttime windows, in which the SIP and MIP are calcu-lated, the likelihood ratio will be de�ned as:

�n =f1(S1; S2; :::; Sn)f0(S1; S2; :::; Sn)

= �ni=1

�f1(Si)f0(Si)

�: (6)

The equation can be converted to an accumulated formof likelihood ratio for independent analysis of each timewindow. Then, given two thresholds of T0 and T1 whereT0 < T1, at each time window, the likelihood ratio iscalculated and it is compared with these thresholds.Therefore:

� If �n � T0, then H0 is accepted;� If �n � T1, then H1 is accepted;� If T0 � �n � T1, then the current information is not

enough to make a decision and the analyses shouldbe continued.

We should note that the above noted thresholds,T0 and T1, can be calculated based on:

T0 =�

1� �; T1 =1� ��

; (7)

where, � and � are the desired false positive and falsenegative rates, respectively.

Based on the SPRT function, MIP and SIP valuesare calculated at the end of each time window for each

domain. Then, the likelihood ratios are calculated.If the ratio exceeds one of the thresholds, it meansthat enough information is provided to decide on adomain and it is assumed as a malicious or benigndomain. Domains with likelihood ratios between thetwo thresholds are saved to be examined in later timewindows.

4. Implementation and evaluation

The proposed system was implemented by integratinga set of modules implemented in C, Java, and Shellscripts on a Linux operating system. Additionally, aset of open source tools, such as Rapidminer [19] andTshark [20], were employed. Furthermore, the systemwas deployed at the Sharif University of Technologycampus, gathering tra�c from the primary DNS serversof the campus. In what follows, we will �rst introducethe dataset used and the evaluation methodology em-ployed in Section 4.1 and then evaluate e�ectiveness ofthe bloom �lters in Section 4.2. Detection accuracy ofthe proposed system is studied in Section 4.3.

4.1. Datasets and evaluation methodologyWe have used two sets of traces in order to evaluate theproposed botnet detection technique. These traces are:

� Trace #1: Multiple DNS traces obtained from theSharif University of Technology primary DNS server(i) from March 1st through 30th, 2012 and (ii) fromJanuary 1st through 31st, 2013.

� Trace #2: DNS traces were collected while execut-ing three botnet samples on a VM for 3 days within acontrolled environment; at the same time, a numberof benign applications were also executed on thesame VM. Afterwards, the collected DNS tra�c wasmerged with the DNS traces collected at the campuslevel between January 1st through 31st, 2013.

For our historical dataset, we employed DNStraces collected from January 1st through December31st, 2011. Furthermore, we used a white-list to removewell known domains (i.e. google, yahoo, etc.) fromthe dataset. The white-list was created based onthe Alexa's ranking [21] and by selecting the top 100popular domains visited by clients. Furthermore, someof domains, we thought could be potentially employedby a botnet (e.g. domains, related to weblog services),were removed from the white-list.

To evaluate accuracy of the proposed technique,we needed to label the domains found in the DNS tracesas either benign or malicious. To that end, a black-listwas created from several sources, which includes theMalwareDomainList [22], DNS Blackhole [23], and At-las [24]. Furthermore, domains generated by botnets,such as Zeus, Spyey, and Palevo [25], where includedin the black-list.


Given the above noted datasets, the proposed bot-net detection system was evaluated based on the falsepositive and false negative rates obtained. However, inissues such as botnet detection, there is not balancebetween the sizes of two classes of data (maliciousand benign domains). Moreover, correct detection ofthe malicious group is more important than that ofthe benign domains. Hence, additionally, the F-scorecriterion is also considered. This criterion is suitablefor the cases in which the two groups of data are notbalanced [26]. F-score is obtained from the de�nitionsof Precision and Recall according to Eqs. (8)-(10).TP, FP, and FN are true positive, false positive, andfalse negative values, respectively.

Precision =TP

TP + FP; (8)

Recall =TP

TP + FN; (9)

F = 2� Precision�RecallPrecision+Recall

: (10)

4.2. Bloom �lter performanceAs noted earlier, the proposed system stores a yearof historical DNS tra�c into the Bloom �lter in orderto �lter long-lived domains, which are benign withhigh probability, as argued before. Therefore, we �rsteliminate popular and well-known benign domains withthe help of the white-list noted earlier. After this initial�ltering, about 12,800,018 domains remain which arethen inserted into the bloom �lter data structure. Theproper size of the storage array for the Bloom �ltercan be calculated considering the number if inserteddomains and the maximum value of 0.001% for the falsepositive error rate, based on Eq. (11) [17]:

m = � n ln p(ln 2)2 ; (11)

where p is the given false positive probability, n is thenumber of elements being inserted, and m is the lengthof the Bloom �lter.

As shown in Table 3, and given the numberof inserted domains, a Bloom �lter with the size of

Table 3. Statistical attributes of the Bloom �lter.

Number of inserted records 12,900,018Bloom �lter array size 382602725 bit = 45MBFalse positive error rate 7.5135E-7

45 MB is required and the practical false positiverate equals 7:5 � 10�7%. Therefore, by using theBloom �lter data structure, we require a relativelysmall memory footprint to store the large number ofhistorical domains.

In the second module of the proposed system,DNS domains observed would be checked as to whetherthey are inserted into the Bloom �lter previously (i.e.,were seen in historical traces), and domains not foundin the historical tra�c will be passed to the detectionprocess. To calculate the performance of the Bloom�lter in reducing the analyzed data, the number ofrequested domains is measured during the analyzedmonth and compared with the number of remainingdomains after Bloom �lter checking. The number ofdomains before �ltering is 4,262,872 and the remainingdistinct domains after comparison with Bloom �lter arejust 773,244 domains. Consequently, history checkingreduces the test tra�c by 82% (see Table 4).

4.3. Detection performanceThe proposed system analyzes the DNS tra�c ina short time window and aggregates the results inconsecutive windows to decide about the status of eachdomain. In the implemented prototype, the size oftime window is considered as a day and the status ofeach domain is determined by moving the time windowand aggregating the values of the likelihood ratios insequential windows.

The proposed mechanism needs some input pa-rameters. First, we should select a threshold for themaximum TTL value of fast- ux domains. Accordingto the characteristics of the current fast- ux botnets,this parameter is considered 3600 seconds [3]. Secondly,we used the traces from March 1st through 30th, 2012(i.e., from Trace #1), as the training dataset withwhich the required thresholds in the SPRT algorithmwere con�gured to provide a FP = 0:15% and FN =0:2%.

Afterwards, we examine the system performanceby using Trace #1 from January 1st through 31th,2013. Although the real tra�c does not include enoughfast- ux domains, the proposed mechanism detects7 malicious domains from the available 8 domainscorrectly, as shown in Table 5. Hence, we obtain adetection rate of 87.5%. Furthermore, three domainsare false positives of the system and the others aredetected correctly.

Examples of detected domains, false positives,and false negatives are shown in Table 6. The false

Table 4. The performance of Bloom �lter on one-month tra�c.

Number of distinct domains before comparison with Bloom �lter 4,262,872

Number of distinct remaining domains after comparison with Bloom �lter 773,244

Reduction percent 82%


Table 5. Result summary of two traces.

Result summary Trace #1 Trace #2Number of distinct domains 180141 180159Number of benign domains 180130 180138Detected domains 7 17False positive 3 3False negatives 1 1Detection rate (recall) 87.5% 94.44%False positive rate 0.002% 0.001%False negative rate 12.5% 5.55%Precision 70% 85%F-score 77.78% 89.47%

Table 6. Detected malicious domains, false positives, andfalse negatives in Trace #1.

Type Domain name SIP/MIP

Malicious domains

gasosvaz.ru. 0.99fetolbus.ru. 0.94fawsilom.ru. 0.91ecrihgep.ru. 0.73ikbyznod.ru. 0.72fyzsicat.ru. 0.71

False positivescm.adgrx.com 0.87oparle.com 0.73api.crtinv.com 0.72

False negative gehxehib.ru. 0.34

Inadequate info epejanhi.ru. 0.52

positives are related to the benign domains with largepools of resolved IPs. Therefore, the proposed systemmis-classi�es them. On the other hand, among theexisting malicious domains in the test tra�c, onedomain is not detected (a false negative). This domainis \gehxehib.ru." which had few requests during theanalyzed month and its ux behavior was not repre-sented during that period. In addition, the requeststo this domain were distributed over several days andjust one or two requests existed in each day. Hence, theproposed system was unable to detect this domain. Theother malicious domain named "epejanhi.ru." receivesan "inadequate information" label as the output of thesystem, which indicates that the SPRT method wasnot able to determine the state of this domain giveninsu�cient activity in the observed period.

Afterwards, we repeated the experiment usingTrace #2, while using the same SPRT threshold valuesas that in the previous experiment. Such an approachwas essential, as there were few malicious domains inTrace #1 dataset and the evaluation metrics employed

Table 7. The detected malicious domains, false positives,and false negatives in Trace #2.

Type Domain name SIP/MIP

Malicious domains

brylanehome.com 0.99

ikbyznod.ru 0.94

carsales.com.au 0.88

stjohnhos.co.uk 0.8

new irtingdates.info 0.78

False positivescm.adgrx.com 0.87

oparle.com 0.73

api.crtinv.com 0.72

False negative gehxehib.ru. 0.34

Inadequate info epejanhi.ru. 0.52

did not properly indicate the accuracy of the system.Overall, we observed 180159 distinct domains in Trace#2. The accuracy and detection rates of the proposedsystem are shown in Table 5. The detection rateequals 94.4% and the false positive rate equals 0.001%.Examples of malicious domains detected and falsepositives are listed in Table 7. The domains whichresulted in false positives and false negatives are thesame as in those of Trace #1.

5. Discussions and future works

In the previous section, we evaluated the accuracyof the proposed detection scheme. One importantquestion which one could raise is that how dependentis the proposed detection scheme on the botnet's levelof activity. In order to analyze this issue, we spreadDNS query/responses collected over a period of threedays from the three botnet samples, over a longerperiod of 5 days and re-executed the experiment.Figure 4 illustrates the number of days needed to detectmalicious domains before and after the time spread.Importantly, the same malicious domains are detectedin both cases and the only di�erence is the detectiontime. Hence, one could conclude that when botnetsreduce their activity, by generating fewer requests perday, the detection system would require more time todetect the malicious domain. This is mainly due to thefact that the proposed detection system aggregates thetra�c information over sequential time windows.

Table 8 compares the proposed system withtwo other existing botnet detection approaches, Bot-GAD [16] and Fluxbuster [10]. BotGAD analyzes timewindows, separately, and it calculates the similarity ofhost activities in independent time windows. Hence,if the activity of the botnet is spread over time,then as each time window is analyzed independently,


Table 8. Comparison between fast- ux botnet detection approaches.

Approach FluxBuster [10] BotGAD [16] Proposed technique

Robustness against minimized activity Low Medium High

History usage Yes No Yes

Capability of data reduction Low N/A High

Calculation overhead High Medium Low

Figure 4. Comparison between the numbers of daysrequired to detect the botnet domains, when usingcollected traces in 3 days, and when spreading the DNSqueries over a period of 5 days.

the botnet will not be detected. Whereas in theproposed technique, information from previous timewindows is also used in order to obtain more accurateresults. On the other hand, Fluxbuster [10] requiresa minimal number of resolved IPs for each domainin order to create a domain cluster and perform therequired classi�cation procedures. Hence, a botnetcould circumvent this mechanism by sub-grouping thebots into distinct groups and avoiding the thresholdrequired by Fluxbuster.

The other parameter of the comparison is theamount of data reduction, which in our case equals82%. We are able to reduce the amount of analyzedDNS tra�c by considering domains visited in a his-torical period, and still being visited today as non-malicious. This is because usually malicious domainshave a short lifetime. In contrast, it seems thatBotGAD performs no pre-�ltering, and Fluxbuster hassimple �ltering rules with low e�ect. Finally, theproposed system has a much lower computational costthan that of the other two approaches. The proposedsystem only requires calculating two probability scoresfor each domain, while FluxBuster and BotGAD arebased on machine learning and similarity-temporal cor-relation, respectively, which require more complicated

processing. It must be noted that we were unableto compare detection accuracy results as they weredependent on the tra�c used for evaluation to whichwe were unable to obtain access.

Nevertheless, the proposed system has its ownlimitations. For example, as the probabilistic functionsare calculated independently in di�erent time windowsfor each domain, an attacker can avoid detectionby reducing the activities of the bots during eachtime window to a very small number of DNS queries(e.g., 3 queries, dependent on the parameters set inSection 3.3). Nevertheless, by doing so, the bots dolimit their own availability. Perhaps such shortcomingcould be alleviated by considering the historical statesof each domain.

6. Conclusion and future works

In this paper, we propose a new system to detectfast- ux botnet domains by passive DNS monitoring.The proposed system �rst �lters out domains seen inhistorical DNS traces, assuming that they are benign.We believe this assumption to be valid as legitimatedomains usually have a longer lifetime, where onthe other hand, botnet domains are usually short-lived. Hence, CDN domains, which are the maincause of mis-classi�cation when looking for maliciousfast- ux domains, are removed. Afterwards, a fewfeatures are calculated to help in properly categorizingthe domain in question as either benign or botnetrelated. Our mechanism needs a small amount ofdata and it analyzes DNS tra�c in a short andsequential time windows and calculates the �nal re-sult by aggregating the output of independent timewindows. The proposed system can detect maliciousdomains with the minimal amount of tra�c by us-ing SPRT and it provides over 94% detection ratewhile generating 0.001% false positive rates basedon experiments using DNS traces from our campusnetwork.

As part of the future works, we are currentlyconsidering how the proposed system could be deployedin a larger-scale network, so that it would be ableto observe and detect further malicious activities.Furthermore, and in order to expand our work, we areinvestigating if grouping domains, resolving to the sameset of IP addresses, could help in detecting domain


uxing in addition to the proposed fast- ux detectiontechnique.

References

1. Brisco, T., DNS Support for Load Balancing, RFC1794 (April 1995).

2. Cardellini, V., Colajanni, M. and Philip, S.Y. \Dy-namic load balancing on web-server systems", IEEEInternet Computing, 3, pp. 28-39 (1999).

3. Caglayan, A., Toothaker, M., Drapaeau, D., Burke, D.and Eaton, G. \Behavioral analysis of fast ux servicenetworks", in Proceedings of the 5th Annual Work-shop on Cyber Security and Information IntelligenceResearch, p. 48, ACM (2009).

4. Gu, G., Perdisci, R., Zhang, J. and Lee, W. \BotMiner:Clustering analysis of network tra�c for protocol-and structure-independent botnet detection", in Pro-ceedings of the 17th USENIX Security Symposium(Security'08) (2008).

5. Gu, G., Porras, P., Yegneswaran, V., Fong, M. andLee, W. \BotHunter: Detecting malware infectionthrough ids-driven dialog correlation", in Proceedingsof the 16th USENIX Security Symposium (Security'07)(August 2007).

6. Gu, G., Zhang, J. and Lee, W. \BotSni�er: Detectingbotnet command and control channels in networktra�c", in Proceedings of the 15th Annual Network andDistributed System Security Symposium (NDSS'08)(February 2008).

7. Wurzinger, P., Bilge, L., Holz, T., Goebel, J., Kruegel,C. and Kirda, E. \Automatically generating models forbotnet detection", in Proceedings of the 14th EuropeanConference on Research in Computer Security (ES-ORICS'09), Berlin, Heidelberg, pp. 232-249, Springer-Verlag (2009).

8. Holz, T., Gorecki, C., Rieck, K. and Freiling, F.\Measuring and detecting fast- ux service networks",in Proceedings of Annual Network and DistributedSystem Security Symposium (NDSS'08) (2008).

9. Perdisci, R., Corona, I., Dagon, D. and Lee, W. \De-tecting malicious ux service networks through passiveanalysis of recursive DNS traces", in Annual ComputerSecurity Applications Conference (ACSAC'09), pp.311-320 IEEE (2009).

10. Perdisci, R., Corona, I. and Giacinto, G. \Earlydetection of malicious ux networks via large-scalepassive DNS tra�c analysis", IEEE Transactions onDependable and Secure Computing, 9(5), pp. 714-726(2012).

11. Antonakakis, M., Perdisci, R., Dagon, D., Lee, W. andFeamster, N. \Building a dynamic reputation systemfor dns", in Proceedings of the 19th Usenix SecuritySymposium (2010).

12. Bilge, L., Kirda, E., Kruegel, C. and Balduzzi, M.\Exposure: Finding malicious domains using passive

DNS analysis", in Proceedings of Annual Network andDistributed System Security Symposium (NDSS'11)(2011).

13. Konte, M., Feamster, N. and Jung, J. \Dynamicsof online scam hosting infrastructure", Passive andActive Network Measurement, pp. 219-228 (2009).

14. Passerini, E., Paleari, R., Martignoni, L. and Bruschi,D. \Fluxor: Detecting and monitoring fast- ux servicenetworks", Detection of Intrusions and Malware, andVulnerability Assessment, pp. 186-206 (2008).

15. Nazario, J. and Holz, T. \As the net churns: Fast- ux botnet observations", in 3rd International Con-ference on Malicious and Unwanted Software (MAL-WARE'08), pp. 24-31, IEEE (2008).

16. Choi, H. and Lee, H. \Identifying botnets by capturinggroup activities in DNS tra�c", Computer Networks,56(1), pp. 20-33 (2012).

17. Bloom, B.H. \Space/time trade-o�s in hash codingwith allowable errors", Communications of the ACM,13(7), pp. 422-426 (1970).

18. Lovric, M., International Encyclopedia of StatisticalScience, Springer London (2011).

19. Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M.and Euler, T. \Yale: Rapid prototyping for complexdata mining tasks", in Proceedings of the 12th ACMSIGKDD International Conference on Knowledge Dis-covery and Data Mining, pp. 935-940, ACM, (2006).

20. Tan P.N., Steinbach M. and Kumar V., Introduction toData Mining, (First Edition). Addison-Wesley Long-man Publishing Co., Inc., Boston, MA, USA (2005).

21. \Alexa top sites" (2014).http://www.alexa.com/topsites.

22. \Malware domain list" (2014).http://www.malwaredomainlist.com/.

23. \Dns-bh - malware domain blocklist" (2014).http://www.malwaredomains.com/.

24. \Atlas" (2014).http://atlas.arbor.net/summary/fast ux/.

25. \The Swiss security blog" (2014).https://abuse.ch.

26. Pang-Ning, T., Steinbach, M., Kumar, V. et al.,Introduction to Data Mining, in Library of Congress,p. 74 (2006).

Biographies

Elahe Soltanaghaei received her BS degree in Com-puter Engineering, with honors, from Amirkabir Uni-versity of Technology, Tehran, Iran in 2011 and her MSdegree in Computer Engineering from Sharif Universityof Technology, Tehran, Iran, in 2013. She is cur-rently a researcher with the Compuco International Co.(International Banking, Finance, and IT Consulting-


Iran/Switzerland). Her main research interests arecomputer networks, network security, and informationsecurity.

Mehdi Kharrazi received his BE degree in ElectricalEngineering from the City College of New York andhis MS and PhD degrees in Electrical Engineering

from the Department of Electrical and Computer Engi-neering, Polytechnic University, Brooklyn, New York,in 2002 and 2006, respectively. He is currently anAssistant Professor with the Department of ComputerEngineering, Sharif University of Technology, Iran.His current research interests include network andmultimedia security.

Reproduced with permission of the copyright owner. Further reproduction prohibited withoutpermission.

Detection of fast- ux botnets through DNS tra c analysises3ce/Elahe...

Documents

Transcript of Detection of fast- ux botnets through DNS tra c analysises3ce/Elahe...