Shady Paths: Leveraging Surﬁng Crowds to Detect …chris/research/doc/ccs13_shadypaths.pdf ·...

Shady Paths: Leveraging Surfing Crowds to Detect

Malicious Web Pages

Gianluca Stringhini, Christopher Kruegel, and Giovanni Vigna

University of California, Santa Barbara

{gianluca, chris, vigna}@cs.ucsb.edu

ABSTRACTThe web is one of the most popular vectors to spread mal-ware. Attackers lure victims to visit compromised web pagesor entice them to click on malicious links. These victims areredirected to sites that exploit their browsers or trick theminto installing malicious software using social engineering.

In this paper, we tackle the problem of detecting mali-cious web pages from a novel angle. Instead of looking atparticular features of a (malicious) web page, we analyzehow a large and diverse set of web browsers reach thesepages. That is, we use the browsers of a collection of webusers to record their interactions with websites, as well asthe redirections they go through to reach their final desti-nations. We then aggregate the di↵erent redirection chainsthat lead to a specific web page and analyze the character-istics of the resulting redirection graph. As we will show,these characteristics can be used to detect malicious pages.

We argue that our approach is less prone to evasion thanprevious systems, allows us to also detect scam pages thatrely on social engineering rather than only those that exploitbrowser vulnerabilities, and can be implemented e�ciently.We developed a system, called SpiderWeb, which imple-ments our proposed approach. We show that this systemworks well in detecting web pages that deliver malware.

Categories and Subject DescriptorsK.6.5 [Security and Protection]: Invasive software; H.3.5[Online Information Services]: Web-based services

General TermsSecurity

KeywordsHTTP redirections, detection, malware

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full cita-

tion on the first page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior specific permission

and/or a fee. Request permissions from [email protected].

CCS’13, November 4–8, 2013, Berlin, Germany.

Copyright 2013 ACM 978-1-4503-2477-9/13/11 ...$15.00.

http://dx.doi.org/10.1145/2508859.2516682.

1. INTRODUCTIONThe world-wide web is one of the main vectors used to

spread malware. The most common way of infecting victimswith malware is to present them with a web page that per-forms a drive-by download attack [18]. In a drive-by down-load attack, the malicious web page tries to exploit a vul-nerability in the victim’s browser or in a browser plugin.If successful, the injected code instructs the victim’s com-puter to download and install malicious software. In othercases, attackers rely on simple social engineering scams toinfect victims. That is, the web page tricks users into down-loading and installing malicious software without exploitingany client-side vulnerability; an example of this are fake an-tivirus scams [23]. In this paper, we refer to the pages usedto deliver both types of attacks as malicious web pages.

Existing systems to identify malicious web pages fall intotwo main categories. The first category uses techniques tostatically analyze web pages and embedded code (such asJavaScript or Flash), looking for features that are typicalof malicious activity. Such features can be patterns in theUniform Resource Locators (URLs) of web pages [14,31], thepresence of words or other information that are associatedwith malicious content [16, 25, 26], or the presence of mali-cious JavaScript snippets that are typical for exploits [2, 3].The second category leverages dynamic techniques; theseapproaches rely on visiting websites with an instrumentedbrowser (often referred to as a honeyclient), and monitor-ing the activity of the machine for signs that are typical ofa successful exploitation (e.g., the creation of a new pro-cess) [15, 20,21,28].

Although existing systems help in making users safer, theysu↵er from a number of limitations. First, attackers can ob-fuscate their pages to make detection harder and, in manycases, they are able to successfully evade feature-based sys-tems [8]. Second, attackers frequently hide their exploits byleveraging a technique called cloaking [27, 30]. This tech-nique works by fingerprinting the victim’s web browser, andrevealing the malicious content only in case the victim is us-ing a specific version of the browser and a vulnerable plugin.Additionally, miscreants check the IP address of their visi-tors, and display a benign page to those IP addresses thatbelong to security companies or research institutions. Cloak-ing makes the work of dynamic analysis approaches muchharder, because defenders need to run every combination ofweb browsers and plugins to ensure complete coverage (oruse special techniques to work around this requirement [9]).Also, defenders require elaborate routing infrastructures tohide the source of their tra�c. A third limitation is that

many techniques, in particular honeyclients, can only de-tect malicious pages that exploit vulnerabilities — they can-not identify scams that target users with social engineering.Fourth, most dynamic techniques introduce a considerableamount of overhead into the instrumented browser, makingthese approaches di�cult to deploy as online detectors.

In this paper, we propose a novel approach to detect mali-cious web pages. To this end, we leverage a large populationof web surfers as diverse, distributed “sensors” that collecttheir interactions with the web pages that they visit. The in-teractions we are interested in are HTTP redirection chainsthat are followed by clients before reaching web pages. Redi-rections are widely used by attackers, because they makedetection harder, and they give the flexibility of changingany intermediate step in the chain, without having to up-date all the entry points (i.e., the pages containing the linksthat start the redirection process). Redirection chains are animportant part of the malware delivery infrastructure and,unlike the web page contents, they are not easy to obfus-cate. Of course, HTTP redirections have many legitimateuses too. For example, advertising networks make extensiveuse of redirections to make sure that a suitable ad is dis-played to the right user. The key di�culty is distinguishingbetween chains that correspond to malicious activity andthose that are legitimate.

A few previous systems, SURF and WarningBird [10,13], have incorporated redirection chain information intotheir detection approaches. To make the distinction betweenmalicious and benign redirects, however, these systems allrequire additional information for their analysis. For ex-ample, WarningBird [10] analyzes features of the Twitterprofile that posted a malicious URL. Unfortunately, attack-ers can easily modify the characteristics of those Twitterprofiles, and make their detection ine↵ective.

In our approach, we aggregate the redirection chains froma collection of di↵erent users, and we build redirection graphs,where each redirection graph shows how a number of usershave reached a specific target web page. By analyzing boththe properties of the set of visitors of a page and its cor-responding redirection graph, we can accurately distinguishbetween legitimate and malicious pages. In particular, weshow that malicious web pages have redirection graphs withvery di↵erent characteristics than the ones shown by legit-imate ones. No information about the destination page isrequired. It is important to note that our approach canonly detect malicious web pages that make use of redirectionchains. However, this is not a severe limitation, because, asprevious work pointed out, the use of redirection chains ispervasive in the malware delivery process [10,13,23,29].

Our approach overcomes the limitations listed previously.We do not look at the content of web pages at all, andtherefore, we are not a↵ected by attempts to obfuscate theelements on a page or the scripts that are executed. More-over, we do not check for exploits, so the system can covera broader class of malicious pages. Also, the e↵ect of cloak-ing is mitigated, because of the diverse population of websurfers involved in the data collection. As a last aspect, thedata collection from the user browsers is lightweight, becausedata is collected as users browse the web, without any ad-ditional computation. Therefore, the performance penaltyexperienced by the users is negligible.

We implemented our approach in a tool called Spider-Web. We ran our system on a dataset representative of the

browsing history of a large population of users, and we showthat SpiderWeb is able to detect malicious web pages thatevade state-of-the-art systems.

In summary, this paper makes the following contributions:

• We propose a new way of detecting malicious webpages by collecting HTTP redirection data from a largeand diverse collection of web users. The individualredirection chains are combined into redirection graphsthat allow for the distinction between legitimate andmalicious pages.

• We developed a system, called SpiderWeb, which isable to detect malicious web pages by looking at theirredirection graphs, and we show that the system workswell in practice, by running it on a dataset collectedfrom the users of a popular anti-virus tool.

• We show that SpiderWeb is a useful tool to improvethe detection provided by state-of-the-art systems, andthat a redirection-graph based detection is hard toevade by attackers.

2. HTTP REDIRECTIONSHTTP redirections are used by web developers to redi-

rect the user to a di↵erent URL than the one originally ac-cessed. Typically, this happens by sending a specific HTTPresponse to the user’s browser, followed by the new pagethat the client should access. The URL of this new pageis sent in the location header field of the HTTP response.When the browser receives a redirection response, it auto-matically redirects the user to the target page. Modernbrowsers transparently redirect their users when they re-ceive a 301, 302, 303, 304, or 307 HTTP response [4]. Al-though each code has a di↵erent meaning, browsers behavein the same way upon receiving any of them, so we do nottreat them di↵erently either. Other types of redirections in-clude HTML-based redirections (triggered by the META tag ina web page), JavaScript-based redirections, or Flash-basedredirections. In our analysis, we do not di↵erentiate betweenthese classes of redirections.

2.1 HTTP Redirection GraphsIn this section, we describe how we build redirection graphs.

We start with introducing HTTP redirection chains, as theyrepresent the building blocks of redirection graphs.Formalizing HTTP redirection chains. To define HTTPredirection chains, we start by defining two types of entities:users and web pages. A user is represented by a tuple

U =< I,O,B,C >,

where I is the IP address of the user, O is a string thatidentifies her operating system, B is a string that identifiesher browser’s configuration (the browser type, version, andthe di↵erent plugins that are installed), and C is the countrycorresponding to I. A web page is represented by a tuple

W =< URL,D,L, TLD,Pg, Pa, I, C >,

where URL is the URL to the web page, D is the domain ofthe URL (including the Top-Level domain), L is the lengthof the domain name (in characters), TLD is the Top-Leveldomain of the URL (notice that we take care of special casessuch as .co.uk), Pg is the name of the page in the URL

(more precisely, the sub-string of the URL that goes fromthe last “/” character to either the “?” character or the endof the URL, whichever comes first), Pa is a set of stringsthat holds the names of the parameters in the URL, I is theIP address that hosts the web page, and C is the countryin which the web page is hosted. For example, consider theURL http://evil.com/a/search.php?q=foo&id=bar. Forthis page, the full URL is U , D is evil.com, L is 4, TLDis .com, Pg is search.php, Pa is (q,id), and the IP ad-dress returned by the DNS server for evil.com is I. C iscalculated by performing geolocation on I.Given our notion of users and web pages, we define a redi-rection chain as a tuple

RC =< Urc,G,Ref,Lan,Fin >,

where Urc is the user visiting the redirection chain and G isa graph < V,E > where V is the set of web pages that areaccessed in the redirection chain and E is a set of edges thatspecifies the order in which the web pages in V are accessed.Each edge is a tuple < v1,v2 > containing the starting webpage instance v1 and the destination web page instance v2.Ref, Lan, and Fin, are three “special” web pages in theredirection chain:

• Ref represents the referrer, which is the URL theuser was coming from when she entered the redirec-tion chain. The referrer could be the URL of the webpage that contains the link that the user clicked on. Al-ternatively, it could be the URL of the web page thathosts an iframe. If the user inserted the URL manu-ally into the browser address bar, the referrer will beempty. In this paper, we consider the referrer to bethe first node of a redirection chain.

• Lan is the landing page, which is the web page onwhich the first redirection happens.

• Fin is the final page, which is the last web page in theredirection chain. The final page is the web page in Gthat has no outgoing edges, except for self-loops.

As we will see later, referrer, landing, and final page carryimportant information about the redirection chain that isuseful to assess the maliciousness of a web page.Building a redirection graph. As mentioned previously,we want to combine the redirection chains directed to thesame destination into a redirection graph. A redirectiongraph is defined as:

RedGraph =< R,Crg,Urg,Grg,Refrg,Lanrg,Finrg >,

where R is the target node where all the chains in the graphend. In the basic case, R will be a single web page witha specific URL. However, in other cases, it might be use-ful to group together web pages that have di↵erent URLs,but share common characteristics. For example, in the caseof a malicious campaign that infects websites using alwaysthe same web page name, the domain names of the infectedpages will be di↵erent for di↵erent URLs. However, otherparts of these URLs will be the same.To support graphs that point to pages with di↵erent URLs,

we allow groups in whichR has certain fields that are markedwith a wildcard. We consider a web page W to match R if,for each field of R that is not a wildcard, W and R containthe same value. In Section 3.2 we describe di↵erent ways forspecifying R and, hence, for aggregating redirection chainsinto a redirection graph.

In our definition of redirection graphs, Crg is a set thatcontains all the individual redirection chains that make upthe graph RedGraph. Urg is a set that contains all usersin the redirection chains Crg. Grg =< Vrg,Erg > is agraph composed of the union of all the redirection chains inCrg. Refrg, Lanrg, and Finrg are sets that contain all thereferrers, the landing pages, and the final pages, respectively,of the redirection chains in Crg.Given a set of redirection chains, and an instance of R, webuild the redirection graph RedGraph as follows:Step 1: Initially, Urg, Grg, Rrg, Lanrg, and Finrg areempty.Step 2: We compare each redirection chain to R. For eachredirection chain, we compare its final page Fin to R. If Finmatches R, we will use the chain to build the redirectiongraph. Otherwise, the redirection chain is discarded.Step 3: For each redirection chain that matches R, we addthe user U to Urg, the referrer Ref to Refrg, the landingpage Lan to Lanrg, and the final page Fin to Finrg.Step 4: For each redirection chain that matches R, wecompare the graph G =< V,E > to the graph Grg =<Vrg,Erg >. For each vertex v in V, if Vrg does not havea vertex with the same exact values as v, we add v to Vrg.For each edge e =< v1,v2 > in E, if Erg does not containan edge starting at v1 and ending at v2,we add the edge eto Erg.

2.2 Legitimate Uses of HTTP RedirectionsHTTP redirections have many legitimate uses. The most

straightforward one is to inform a visitor that a certain web-site has moved. For example, if a user accesses a web pageunder an old domain, she will be automatically redirectedto the new one. As another example, HTTP redirectionsare used to perform logins on websites. If a user accessesa page without being logged in, she is redirected to a lo-gin page. After providing her credentials, the web page willcheck them and, in case they are correct, redirect the userback to the page she originally tried to access. Advertisingnetworks (ad networks) are another example for legitimateuses of HTTP redirections. Typically, advertisements un-dergo a series of redirections, to allow the advertiser to findthe ad that is the right match for a user [24].

Since redirection chains are so pervasive, we cannot simplyflag all of them as malicious. In certain ways, malicious andbenign redirection graphs look very similar. Fortunately,there are some features that are characteristic of maliciousredirection graphs, and these are the ones that we can usefor detection. We describe them in detail in the next section.

2.3 Malicious Redirection GraphsTo build our system, we are interested in the character-

istics that distinguish malicious redirection graphs from theones generated by legitimate uses.Uniform software configuration. The exploits and/ormalicious code used by cyber-criminals target a specific (setof) browser, plugin, or operating system versions. To avoidgiving away their exploit code, or triggering content-awaredetection systems when not needed, attackers typically usecloaking, and serve the malicious JavaScript only when theexploit has a chance of successfully compromising the user’smachine. Cloaking can be performed by the malicious pageitself, by fingerprinting the user’s browser on the client side.Alternatively, it can be performed during an intermediate

step of the redirection chain on the server side [29]. In thelatter case, the victim will be redirected to the exploit pageonly in case her operating system and/or browser are vulner-able to the exploits that the malicious web page is serving.

From the point of view of our system, cloaking performedduring the redirection process will lead to the situation whereonly victims with a uniform software configuration visit themalicious web page. Users who do not run a browser thatmatches a vulnerable software configuration will be redi-rected to a di↵erent page, possibly on a di↵erent domain.Note that modern exploit kits target multiple browsers andplugins versions. For this reason, in some cases, we do notobserve a single browser version accessing a malicious webpage, but a few of them. However, the diversity of browserconfigurations visiting the attack page will still be lower thanthe one visiting a popular legitimate web page.Cross-domain redirections. Malicious redirection chainstypically span over multiple domains. This happens mostlybecause it makes the malicious infrastructure resilient totake-down e↵orts. It also makes it modular, in the sensethat a redirection step can be replaced without having tomodify all the web pages in the redirection chain. On theother hand, benign redirection chains typically contain oneor more pages on the same domain. An example of this be-havior is the login process on a website; a user who is notauthenticated is first redirected to a login page, and, afterproviding valid credentials, back to the original page.Hubs to aggregate and distribute tra�c. It is commonfor victims who click on malicious links to be first redirectedto an aggregating page, which forwards them to one of sev-eral backend pages [28]. This behavior is similar to the oneseen with legitimate advertising networks, where the useris first redirected to a central distribution point. From thispoint, the user will then be forwarded to one of the manydi↵erent ads. Attackers use this type of infrastructure forrobustness. By having a hub, it is easy for them to redirecttheir victims to a new, malicious web page in case the cur-rent one is taken down. Since the aggregation point does notserve the actual attack, it is more di�cult to shut it down.Commodity exploit kits. The underground economy hasevolved to the point that exploit kits are sold to customerslike any other software, and even o↵ered as software as a ser-vice [5]. An exploit kit is a software package that takes careof exploiting the victims’ browsers and infecting their com-puters with malware. After buying an exploit kit, a cyber-criminal can customize it to suit his needs, and install it ona number of servers. Typically, this customization includesthe selection of a specific template that is used to generateone-time URLs on the fly. Previous work showed that de-tecting such templates is important, and can be leveragedfor malicious URL detection [31]. We assume that di↵er-ent groups of attackers might use the same URL-generationtemplates for each server they control.Geographically distributed victims. Cyber-criminalsoften try to reach as many victims as possible, irrespectiveof the country they live in. As a result, the victims of acampaign will come from a diverse set of countries and re-gions. On the other hand, many legitimate sites have a lessdiverse user population, since a large fraction of their userscome from the country in which the website is located.

3. SYSTEM DESIGNWe leveraged the observations described in the previous

section to build a system, called SpiderWeb, which deter-mines the maliciousness of a web page by looking at its redi-rection graph. Our approach operates in three steps: First,we need to obtain input data for SpiderWeb. To this end,we collect redirection chains from a large set of diverse webusers. In the second step, SpiderWeb aggregates the redi-rection chains leading to the same (or similar) page(s) intoredirection graphs. As a last step, SpiderWeb analyzeseach redirection graph, and decides whether it is malicious.

3.1 Collecting Input DataSpiderWeb requires a set of redirection chains as input.

For this work, we obtained a dataset from a large anti-virus(AV) vendor. This dataset contains redirection chain infor-mation from a large number of their global users. The datawas generated by the users who installed a browser securityproduct, and voluntarily shared their data with the companyfor research purposes. This data was collected according tothe company’s privacy policy. In particular, the data did notcontain any personal identifiable information: The source IPof the clients was replaced with a unique identifier and thecountry of origin. Moreover, the parameter values in URLsalong the redirection chains were obfuscated. Thus, the datacontained only the domain, path and parameter name infor-mation of the URLs, which is su�cient for our approach.We will discuss the dataset in more detail in Section 4.

Note that SpiderWeb is not limited to this particulardataset. Any collection of redirection data from a crowd ofdiverse users would represent a suitable input.

3.2 Creating Redirection GraphsGiven a set of redirection chains, the second step is to

aggregate them into redirection graphs. To this end, thesystem leverages the algorithm described in Section 2.1.

A key challenge is to determine which pages should beconsidered “the same”by this algorithm. The most straight-forward approach is to consider pages to be the same onlywhen their URLs (domain, path, parameter names) matchexactly. However, certain malware campaigns change theirURLs frequently to avoid detection. This can be done bychanging the domain or by varying the path and parame-ter values. If SpiderWeb would only aggregate redirectionchains that lead to the same exact URL, it might miss cam-paigns that obfuscate their URLs. To overcome this prob-lem, we propose five di↵erent ways to recognize variations ofdestination pages, where all chains that lead to one of thesepages should be aggregated. In other words, there are fivedi↵erent ways to specify page similarity via R.Domain + page name + parameters: This grouping isthe most specific one. Pages are considered the same onlywhen all parts of the URL (domain, path and name of thepage, and parameter names) match.Top-level domain + page name: With this grouping,two pages will be considered the same when they share thesame top-level domain and page name. Cyber-criminals fre-quently register their domains in bulk, using less popularand less vigilant domain registrars (e.g., .co.cc or .am) [7].However, they use an automated setup for their exploit in-frastructure, and hence, the page names remain unchanged.Top-level domain + page name + parameters: Thisgrouping is similar to the previous one, but also takes into

Domain+Page+Parameters: evil.com,index.php,idTLD+Page: com,index.phpTLD+Page+Parameters: com,index.php,idDomain length+TLD+Page+Param.: 4,com,index.php,idIP: IP address for evil.com

Table 1: Di↵erent fields used for grouping for theURL http://evil.com/dir/index.php?id=victim.

account the name of the parameters in the URL. This group-ing works well for campaigns that use specific URL parame-ters. It is also useful to detect malicious campaigns in caseswhere the previous grouping would be too general (accordingto our popularity filter, described below).Domain name length + Top Level Domain + pagename + parameters: For this grouping, we consider thelength of the domain name instead of the concrete nameitself. This grouping is useful to detect campaigns that cre-ate domains based on a specific template, such as randomstrings of a specific length.IP address: This grouping leverages the observation thateven if attackers completely vary their URLs (and hence,could bypass the previous groupings), they only use a smallset of servers to host their malicious web pages (typically, onbulletproof hosting providers). In such cases, the IP addressgrouping will allow SpiderWeb to aggregate all chains thatend at the same IP address.

To see an example of the di↵erent parts of the URL thatare considered for each grouping, consider Table 1. In addi-tion to groupings, SpiderWeb leverages two thresholds dur-ing its operation. The first threshold discards those group-ings that are too general, while the second one discards redi-rection graphs that have low complexity, and do not allowan accurate decision.Popularity filter. Intuitively, certain ways of groupingsimilar pages might be too generic and result in the combi-nation of chains that do not belong together. For example,grouping all chains that end in URLs that have .com as theirtop-level domain and index.html as their page name wouldnot produce a meaningful redirection graph. The reason isthat most of the landing pages in Finrg will not be con-trolled by the same entity, and, thus, will not share similarproperties. To exclude unwanted redirection graphs fromfurther analysis, we introduce a popularity metric. Moreprecisely, for each redirection graph, we look at its redi-rection chains and extract all domains that belong to thechains’ final pages. If any domain has been accessed morethan a threshold tp times in the entire dataset, the graphis discarded. The intuition is that one (or more) domain(s)included in the graph are so popular that it is very unlikelythat this graph captures truly malicious activity. In Sec-tion 4.2, we describe how we determined a suitable value forthreshold tp for our dataset.Complexity filter. On the other side of the spectrum (op-posite of too generic graphs) are graphs that are too small.That is, these graphs contain too few chains to extract mean-ingful features. Typically, these graphs belong to legitimatesites where one landing page redirects to one specific desti-nation. Intuitively, this makes sense: A malicious website islikely to be composed of redirections that start at many dif-ferent web sites (e.g., infected web pages). For this reason,SpiderWeb discards any redirection graph that contains

less than tc distinct redirection chains. Again, we discussour choice for the value of tc in Section 4.2.

3.3 Classification ComponentIn the third step, SpiderWeb takes a redirection graph

as input and produces a verdict on whether the final webpage(s) of this graph are malicious or not. This step repre-sents the core of our detection technique. We developed aset of 28 features that describe the characteristics of a redi-rection graph. These features fall into five main categories:client features, referrer features, landing page features, finalpage features, and redirection graph features. Their purposeis to capture the typical characteristics of malicious redirec-tion graphs as reflected in the discussion in Section 2.3.

An overview of our features is given in Table 2. It canbe seen that most features (21 out of 28) have not beenused before. The reason is that many features rely on thefact that we collect data from a distributed crowd of di-verse users. Specifically, these include client features andredirection graph features. In the following, we describe ourfeatures in more detail. Then, we discuss how the featuresare leveraged to build a classifier.Client features. These features analyze characteristics ofthe user population Urg that followed the redirection chainsin the graph RedGraph. The Country Diversity feature iscomputed by counting the number of di↵erent countries forthe users in Urg and dividing this value by the size of Crg

(which is the number of redirection chains in RedGraph).As discussed previously, many legitimate web pages are vis-ited by users in a small number of countries. The BrowserDiversity and Operating Systems Diversity features reflectthe diversity in the software configurations of the clients thataccess a certain web page. Similarly to the Country Diver-sity feature, these two features are computed by taking thenumber of distinct browsers and operating systems in Urg

and dividing it by the size of Crg. If all clients that visit apage share a similar software configuration, this is often theresult of profiling their browsers to deliver e↵ective exploits.Referrer features. These features capture characteristicsof the referrers that lead users into redirection chains. Re-ferrers provide information about the locations of the linkson which users click to reach the actual final web page, orinformation about the locations of iframes, in case iframesare used to trigger redirections.

The Distinct Referrer URLs counts the number of di↵er-ent referrer URLs in Refrg, divided by the size of Crg. Ma-licious redirection chains are typically accessed by a smallernumber of referrers than legitimate ones. For example, a ma-licious campaign that spreads through social networks willshow only referrers from such sites (e.g., Twitter of Face-book), while a legitimate news site will be pointed to by ahigher variety of web pages (e.g., blogs or other news sites).The Referrer Country Ratio feature is computed similarlyto the Country Diversity described in the previous section.This ratio characterizes how geographically distributed thevarious starting points (referrers) that lead to a certain webpage are. Certain types of legitimate web pages (e.g., newssites), are accessed mostly through links posted in specificcountries. On the other hand, malicious web pages are of-ten linked by websites that reside in di↵erent countries. TheReferrer Parameter Ratio feature is computed by taking thenumber of distinct parameters Pa in Refrg, divided by thesize of Crg. This feature provides information about the di-

Feature Category Feature Name Feature Domain Novel

Client Features Country Diversity Real 4Browser Diversity Real 4Operating System Diversity Real 4

Referrer Features Distinct Referrer URLs Integer 4Referrer Country Ratio Real 4Referrer Parameter Ratio Real 4Referrer Has Parameters Real 4

Landing Page Features Distinct Landing Page URLs Integer [10]Landing Country Ratio Real 4Landing Parameter Ratio Real 4Landing Has Parameters Real 4

Final Page Features Distinct Final Page URLs Integer [10]Final Parameter Ratio Real 4Top-Level Domain String [25]Page String [25]Domain is an IP Boolean [13]

Redirection Graph Features Maximum Chain Length Integer [10,13]Minimum Chain Length Integer [10,13]Maximum Same Country of IP and Referrer Boolean 4Maximum Same Country Referrer and Landing Page Boolean 4Maximum Same Country Referrer and Final Page Boolean 4Minimum Same Country of IP and Referrer Boolean 4Minimum Same Country Referrer and Landing Page Boolean 4Minimum Same Country Referrer and Final Page Boolean 4Intra-domain Step Boolean 4Graph has Hub 30% Boolean 4Graph has Hub 80% Boolean 4Self-loop on the Final Page Boolean 4

Table 2: Features used by SpiderWeb to detect malicious domains. It can be seen that most features used byour system are novel. For those that have been used in the past, we included a reference to the works thatused such features.

versity of the pages that lead to the redirection chain. TheReferrer Has Parameters feature measures how many of theweb page entities in Refrg have a value for Pa, divided bythe size of Crg. The last two features help in understandinghow a web page is accessed. For example, in the case of amalicious web page that is promoted through Search EngineOptimization (SEO), one expects to see a high number ofdistinct search engine queries that lead to a malicious page.This is not the case for legitimate pages, which will be foundwith a limited number of search keywords.Landing page features. Analyzing the landing pages ofa redirection graph provides interesting insights into the di-versity of the URLs that are used as landing pages. Thefour features in this category (Distinct Landing Page URLs,Landing Country Ratio, Landing Parameter Ratio, and Land-ing Has Parameters) are calculated in the same way as forthe referrer features, by dividing the distinct value of URLs,countries, parameters, and the number of non-empty param-eters in Lanrg by the size of Crg. Legitimate and maliciouslanding pages di↵er in a similar way to the referrer.Final page features. Similar to the referrer and the land-ing page analysis, examining the characteristics of the finalpages helps us in distinguishing between legitimate and ma-licious redirection graphs.

The Distinct Final Page URLs feature value is calculatedin the same way as the value for Distinct Referrer URLs.Having a large number of distinct URLs for the final page issuspicious – it indicates that attackers are generating one-time URLs. The Final Parameter Ratio is calculated in thesame way as for the referrer and the landing pages. Based onour observations, benign redirection graphs exhibit a highervariety of GET parameters than malicious ones. One rea-son for this is that legitimate web pages often use singlepages that provide multiple services, and each di↵erent ser-vice is selected by a set of specific GET parameters. The

Top-Level Domain feature identifies the set of top-level do-main(s). Some top-level domains are more likely to hostmalicious web pages than others, because of the cost of reg-istering a new domain that contains such a top-level do-main [12]. The Page feature represents the page names, ifany (e.g., index.html). As explained previously, many ex-ploit kits use recognizable URL patterns. The Domain is anIP feature indicates if, for any destination web page, the do-main value is the same as the IP. Many malicious web pagesare accessed through their IP addresses directly, as previouswork noted [13].Redirection graph features. These features characterizethe redirection graph as a whole. In Section 4.3, we showthat this category of features is critical for malicious webpage detection, because it allows us to flag as malicious alarge number of malicious web pages that would look legiti-mate otherwise.

The Maximum and Minimum Chain Length features cap-ture the maximum and minimum number of redirection stepsobserved for a single redirection chain in Crg. Previouswork showed that a long redirection chain can be a signof malicious activity [10,13]. We then introduce six Booleanfeatures that model the country distributions for importantnodes that are involved in the redirection process. The cal-culation of these six features works as follows: For all redi-rection chains in Crg, we examine pairs of important ele-ments s1 and s2. The elements s1 and s2 can be either theuser and the referrer, the referrer and the landing page, orthe referrer and the final page. For each pair s1 and s2, wecheck whether the corresponding entities are located in thesame country. If this is true, we set the value V for this chainto true, otherwise we set it to false. In the next step, for eachof the three pairs, we compute the logic AND as well as thelogic OR for the values V . We expect legitimate redirectionchains to have more geographic consistency across the dif-

ferent steps of the chain (i.e., having multiple steps locatedin the same country). On the other hand, malicious redirec-tion chains, which are often made by compromised servers,tend to have multiple steps located in di↵erent countries.A similar observation was made by Lu et al. [13]. Noticethat, for the features in this category, we only consideredthe minimum and maximum values, and not the average.The reason for this is that the average values turned outnot to be discriminative at all. We perform a more detailedanalysis of the features that we decided to use in Section 4.3.

The Intra-domain Step feature captures whether, alongany of the redirection chains in Crg, there is at least onestep where a page redirects to another page on the samedomain. If this holds, the feature is set to true, otherwise,it is set to false. Legitimate redirection chains often containintra-domain steps, while malicious chains typically containdomains that are all di↵erent.

Two Graph has Hub features (30% and 80%) characterizethe shape of the redirection graph. More precisely, theycapture the fact that a hub is present in a redirection graph.To locate a hub, we look for a page (other than any finalpage in Finrg) that is present in at least 30% (or 80%) of theredirection chains, respectively. These features are based onthe assumption that malicious redirection graphs are muchmore likely to contain hubs.

The Self-loop on Final Page is true when any of the redi-rection chains in Crg has a self-loop for the final page. Aself-loop is a redirection that forwards a user to the samepage from where she started. This behavior is often presentin benign redirection graphs, where users are redirected tothe same page but with di↵erent parameters. On the otherhand, malicious redirection graphs typically send the victimfrom a compromised (legitimate) page to a page that hasbeen prepared by the attacker. This is the page that servesthe actual exploit, and there is no self-loop present.Building and using a classifier. We use the features de-scribed previously (and a labeled training set) to build aclassifier. This classifier takes a redirection graph as input.If the graph does not contain at least tc distinct redirectionchains, we discard it as benign (because of the complexityfilter). If any of its domains has been visited by more thantp distinct IP addresses, we also mark the graph as benign(because of the popularity filter). Otherwise, we computevalues for our features and submit the resulting feature vec-tor to a classifier based on Support Vector Machines (SVM)trained with Sequential Minimal Optimization (SMO) [17].We used the implementation provided by the Weka machine-learning tool to this end [6]1. When the SVM determinesthat the redirection graph should be considered as malicious,SpiderWebmarks all final page(s) that belong to this graphas malicious.

4. EVALUATIONIn this section, we describe how we evaluated the ability

of SpiderWeb to detect malicious web pages. For our eval-uation, we leveraged a dataset S that contains redirectionchains collected over a period of two months, from February1, 2012 to March 31, 2012 from the customers of a browsersecurity tool developed by a large AV vendor.

1We deal with features of di↵erent types as follows: we con-vert strings to integers, by substituting each string with anindex, and we treat boolean values as 0 and 1.

The users that generated this tra�c are representative ofthe users of this security product: they are distributed acrossthe globe, and they use a variety of di↵erent software con-figurations. For each HTTP redirection chain, the companylogged a time stamp, the obfuscated IP address and coun-try of origin of the client, the initial URL that is requested,the referrer of this request, the final destination URL, all thesteps along the chain, and the operating system and browserof the user. In addition, some records are labeled as mali-cious or benign, a verdict made by the AV tool running onthe client. This was done by applying heuristics to identifymalicious content in the final pages. If the final page trig-gered a heuristics, the record is flagged as malicious. On theother hand, the AV vendor employs a whitelist of known,good websites to flag records as benign. If a page is neitherdetected as bad nor good, that record remains unlabeled.

The entire dataset S was composed of 388,098 redirectionchains, which lead to 34,011 distinct URLs. To estimate thetotal number of users in the dataset, we calculated the av-erage number of distinct (obfuscated) IP addresses observedduring periods of one week (over all weeks in our dataset).Although this number does not give us a precise number ofusers, it gives us an idea of their magnitude. The averagenumber of distinct IP addresses per week is 13,780. Theusers that generated the dataset were located in 145 coun-tries, and therefore the produced dataset is representativeof the surfing behavior of di↵erent people across the globe.

4.1 Building a Labeled DatasetTo train the classifier needed by SpiderWeb to detect

malicious web pages, we need a labeled dataset of benignand malicious redirection graphs. To this end, we look atredirection graphs built based on the data collected duringthe first month by the AV vendor (February 2012). As apreliminary filter, we set tc to 2, which means that we con-sidered only redirection graphs that were composed of atleast two distinct redirection chains.

For labeling purposes, if the AV vendor flagged the do-main of Fin as malicious, we manually assessed the mali-ciousness of the URL. For example, we considered domainsthat did not exist anymore at the time of our experiment asclearly malicious. Also, we checked the verdict given by theAV vendor to all the redirection chains leading to the sameURL. If not all the verdicts agreed in considering the pageas malicious, we did not consider that redirection graph forour ground truth. The detection techniques used by the AVvendor are conservative. That is, when they flag a page asmalicious, it is very likely that the page is indeed bad. Theopposite, however, is not true and, as we will show later,they miss a non-trivial fraction of malicious pages. In total,we picked 2,533 redirection chains flagged as malicious bythe AV vendor, leading to 1,854 unique URLs.

We then selected, among the redirection chains in S whosedomain was flagged explicitly as benign by the AV vendor,those redirection chains relative to domains that we considerreputable and legitimate. In particular, we picked both pop-ular websites, such as Google and Facebook, as well as lesspopular sites that appeared legitimate after careful man-ual inspection (e.g., local news sites). In the end, we picked2,466 redirection chains, redirecting to 510 legitimate URLs.More precisely, the final pages of the legitimate redirectionchains include all the top 10 Alexa domains [1], and 48 ofthe top 100. In the remaining of the paper, we refer to the

ground truth set, composed of the manually labeled mali-cious and legitimate redirection chains, as Tr.

4.2 Selecting ThresholdsAs discussed in Section 3, SpiderWeb relies on two thresh-

olds: a complexity threshold tc, to discard any redirectiongraph that is not complex enough to guarantee an accurateclassification, and a popularity threshold tp, to discard thosegroupings that are too general. In this section, we discusshow we selected the optimal value for these thresholds.Calculation of the complexity threshold. We neededto understand what is the“complexity” that allows for a reli-able classification for a redirection graph. With complexity,we mean the number of distinct redirection chains in Crg.Ideally, having graphs of a certain complexity would allowSpiderWeb to detect all malicious redirection graphs withno false positives.

We set up our experiment as follows. First, we built theDomain + page + parameters groupings from Tr. Then, foreach threshold value i between 2 and 10, we removed anyredirection graph that had complexity lower than i. We callthe remaining set of groupings Gri. We then performed aten-fold cross validation on each of the Gri sets, by usingSupport Vector Machines with SMO, and analyzed the per-formance of the classifier for each threshold value. As we in-crease the threshold, the number of false positives decreases.With a threshold of 6, the number of false positives flaggedby SpiderWeb is zero. At this threshold, the number offalse negatives also reaches zero. This gives us confidencethat, by being able to observe enough people surfing theweb, SpiderWeb could reliably detect any malicious redi-rection graph.Unfortunately, as the threshold value increases, the num-

ber of possible detections decreases. This happens becausethe dataset S is limited, and the number of graphs that havehigh complexity is low. For example, with a threshold tc of6, Gr6 is composed by only 131 redirection graphs, of whichonly 3 are malicious. Thus, to detect a reasonable numberof malicious web pages in our dataset, we have to use a lowertc, and accept a certain rate of false positives. We decidedto set tc to 4. At this threshold, the number of redirectiongraphs in Gr4 is 196, of which 25 are malicious. A 10-foldcross-validation on Gr4 results in a 1.2% false positive rate,and in a true positive rate of 83%.Calculation of the popularity threshold. To derive asuitable value for tp, we analyzed the Top-Level Domain +page groupings calculated from Tr. We then ran a precisionversus recall analysis. The reason we did not use the samegrouping as for the evaluation of tc is that, as we said previ-ously, tp aims at detecting those groupings that are too gen-eral for our approach to perform an accurate detection. Forthis reason, we used Top-Level Domain + page groupings,which are the most general among the ones used by Spi-derWeb. For each value of tp, we discarded any groupingthat had at least one domain that had been visited by morethan tp distinct IP addresses. For the remaining groupings,we analyzed the URLs contained in it. If all the domainsthat the URLs belonged to ever got marked as maliciousby the security company, we flagged the grouping as a truepositive. Otherwise, we flagged the grouping as too general(i.e., a false positive). Note that, in this experiment, a singledomain flagged as benign by the AV vendor is su�cient toflag the whole grouping as too general. Figure 1 shows the

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 40 45 50

Threshold value

Precision vs. Recall

Figure 1: Precision vs. Recall analysis of the popu-larity threshold value.

outcome of the precision versus recall analysis. In the caseof our dataset, the optimal value for tp turned out to be 8.Notice that this analysis is specific to the dataset S. In caseSpiderWeb would be run on a di↵erent dataset, an addi-tional analysis of tc should be performed, and the optimalvalue of the threshold might be di↵erent.

4.3 Analysis of the ClassifierGiven the labeled ground truth setTr and suitable thresh-

old values for tc and tp, we wanted to understand how wellour classifier works in finding malicious pages. We per-formed this analysis for di↵erent choices for R (which de-termines how redirection chains are grouped into graphs).More precisely, using the dataset Tr, we first generated theredirection graphs leading to the Top-Level Domain + Page,the Top-Level Domain + Page + Parameters, the Length +Top-Level Domain + Page + Parameters, and the Domain+ Page + Parameters groupings, respectively2. We thenremoved any redirection graph that exceeded the popularitythreshold tp, or that did not have at least complexity equalto tc. Each set of redirection graphs (one set of graphs foreach choice of R) was used in a 10-fold cross validation ex-periment. The results of these experiments are shown inTable 3. As it can be seen, the false positive rate for thedi↵erent groupings is generally low. On the other hand, thenumber of false negatives could appear high. However, as wediscussed previously, being able to observe a higher numberof connections would allow SpiderWeb to use a higher valuefor tc, and lower both false positives and false negatives. Ta-ble 3 also shows how our system compares to two classifiersbased on previous work (SURF and WarningBird). Wedescribe these experiments in Section 4.5.

We then analyzed the expressiveness of our features. Fromour experiments, the most e↵ective features that character-ize a malicious redirection graph are, in order, the “ReferrerCountry Ratio,” the “Graph has Hub 80%,” and the “Fi-nal Page Domain is an IP” features. On the other hand,the most expressive features that characterize benign redi-rection graphs are the “Landing Page Country Ratio,” the“Intra-domain Step,” and the “Top-Level Domain” feature.

2Note that we did not use the IP address grouping. Thereason is that when we obtained the dataset, it was alreadya few weeks old. Unfortunately, many of the malicious do-mains did no longer resolve. In those cases, it was impossiblefor us to determine the IP addresses that corresponded tomalicious domains.

System # of chains # of URLs False Positive rate False Negative rate F-Measure

TLD + Page 112 1,945 2.5% 16.1% 0.881TLD + Page + Parameters 112 1,642 0% 21.4% 0.88Length + TLD + Page + Parameters 139 1,233 3.4% 20.4% 0.857Domain + Page + Parameters 196 815 1.2% 17% 0.863No graph-specific features 196 815 1.2% 67% 0.43No client-specific features 196 815 1.2% 21% 0.844No client and graph-specific features 196 815 1.5% 84.78% 0.113SURF-based classifier 196 815 0% 81% 0.319WarningBird-based classifier 196 815 2.9% 62% 0.492

Table 3: Summary of the classification performance for di↵erent groupings and systems.

Interestingly, features that would intuitively make senseturned out not to be the most discriminating. For exam-ple, it is commonly believed that a long redirection chainis indicative of a malicious web page [10, 13]. As it turnsout, the redirection chain length is not among the most im-portant features. In particular, having a short redirectionchain is the 8th most important characterizing feature for amalicious web page, while having a long redirection chain isthe 9th most characterizing feature for a benign chain.

As an additional experiment, we also wanted to evaluatehow much the features that can only be collected by a crowdof users, and not by a centralized collector, contribute tothe detection of malicious web pages. First, we disabled thetwelve graph-specific features in the classifier, and we per-formed another 10-fold cross validation using this featureset. We used the Domain + Page + Parameters groupingsfor this experiment. Surprisingly, the ten-fold cross vali-dation returned the same false positive rate as the one forthe full classifier (1.2%). However, the true positive ratedropped to only 33%, compared to the 83% of the full clas-sifier. Then, we ran the 10-fold cross validation with thethree client-specific features disabled. Again, we obtainedthe same false positive rate than for the full classifier, butthe true positive rate dropped to 79%. By disabling both thegraph-specific and the client-specific features, the true posi-tive rate drops to 15.2%. This means that the features thatcan only be collected by observing a crowd of users are im-portant to achieve a successful classification. Comparativeresults with the full classifier are shown in Table 3. Theseresults give us confidence that SpiderWeb indeed improvesthe state of the art of malicious web page detection, allowingus to perform detection based on features that are specific ofthe malicious web page ecosystem’s structure, rather thanon the web page content. Since the redirection-specific fea-tures are harder to obfuscate than the content-specific ones,we believe that our technique can help in performing a morereliable malicious web page detection. To underline this,we performed an experiment that shows that SpiderWeboutperforms previous systems (assuming that attackers per-fectly succeed in evading the context-specific features). Wediscuss this experiment in Section 4.5.

4.4 Detection Results on Entire DatasetIn the next experiment, we evaluated SpiderWeb on the

redirection graphs built from the entire dataset S collectedby the security company. After checking for complexity andpopularity, SpiderWeb generated 3,549 redirection graphsfrom this dataset, obtained with the Top-Level Domain +page (700 graphs), the Top-Level Domain + page + parame-ters (825 graphs), the Domain Length + Top-Level Domain+ page + parameters (957 graphs), and the Domain + page+ parameters (1,067 graphs) groupings.

Of the 3,549 redirection graphs built from S (using thefour groupings), 562 graphs were flagged as malicious byour detection algorithm. In total, these graphs led to 3,368URLs. We manually identified seven large-scale campaigns,composed of URLs that were sharing common traits.

To check how accurate the results produced by Spider-Web are, we compared our detections with the ones per-formed by the AV vendor on the same dataset. To this end,we built a set Fina of analyzed URLs. For each redirectiongraph in S, we added the final URLs in Finrg to Fina. In to-tal, the AV vendor flagged 3,156 URLs in Fina as malicious.For 2,590 URLs, SpiderWeb and the defenses deployed bythe AV vendor agreed. We then manually analyzed the re-maining 556 URLs that the AV vendor detected, but Spi-derWeb did not. 59 of them turned out to be legitimate websites that hosted malicious JavaScript, probably because ofa cross-site scripting (XSS) vulnerability. SpiderWeb wasnot designed to detect this kind of web pages, whose redirec-tion graphs, as it turns out, do not show the characteristicsthat are typical of malicious redirection graphs. The redi-rection graphs leading to the remaining 497 URLs had lowcomplexity (on average, they were accessed by five distinctredirection chains). As we show in Section 4.2, SpiderWebneeds to observe six or more distinct redirection chains toform a graph that leads to no false positives or false nega-tives. We are confident that, if more data were available tous, SpiderWeb would have been able to detect also theseURLs as malicious.

On the other hand, SpiderWeb detected 778 URLs asmalicious that went undetected by the AV vendor. We man-ually analyzed them, to assess false positives. The vast ma-jority of URLs returned non-existing domain errors when wetried to resolve them. Others showed clear patterns in theirURLs typical of exploit kits. For 51 (out of 778) URLs, wecould not confirm the maliciousness of the page. We considerall of them as false positives of SpiderWeb. This accountsfor 1.5% of the total.

Even if our system has false positives, we think that thesefalse positives are so low in number that it would be possi-ble to whitelist such domains, and use SpiderWeb to detectand blacklist previously-unknown malicious web pages. Inaddition, SpiderWeb is able to perform more comprehen-sive detections than existing techniques. Interestingly, inseveral cases, the software deployed by the AV vendor iden-tified only a few of the malicious web pages that belong to acertain campaign, because only those pages triggered theirsignature-based detection. The detection of the other pageswas successfully evaded by the attackers. On the other hand,once SpiderWeb detects a redirection graph as malicious,it flags each URL in Finrg as malicious. Because of the na-ture of the redirection graphs, and their importance for theattacker, evading detection by SpiderWeb is more di�cult.

4.5 Comparison with Other SystemsTo the best of our knowledge, SpiderWeb is the first

system that is able to detect whether a web page is maliciousor not just by looking at the redirection chains that lead toit, and at the characteristics of the clients that access it.

However, there are a few recent systems that do make useof redirection information to detect malicious web pages.Unlike SpiderWeb, these systems also leverage “context”(domain-specific) information associated with the redirec-tion chain itself, such as information about the page onwhich links are posted. In this section, we show that pre-vious systems depend on such contextual information, andthat their detection capability su↵ers significantly if this in-formation is not present. The first problem with these ap-proaches is that attackers can easily modify such contextinformation, for example by changing the way in which theydisseminate malicious links. A second problem is that thecontext information leveraged by previous systems is onlyavailable in specific situations, like in the case of a URLposted on Twitter or of a SEO campaign. SpiderWeb, onthe other hand, can operate regardless of where the initialURL has been posted. To show this, we compared howSpiderWeb performs compared to two classifiers based onSURF [13] and WarningBird [10].

WarningBird is a system that aims at detecting mali-cious web pages, such as spam and malware pages, whoselinks have been posted on Twitter. It takes advantage ofredirection graphs, as well as information about the accountsthat posted the URLs on Twitter. SURF, on the otherhand, is a system that analyzes redirection chains to detectmalicious pages that use Search Engine Optimization (SEO)to attract victims. To aid its detection, SURF looks at thesearch words that attackers poisoned to make their maliciouspages appear as results on search engines. Although bothsystems, to some extent, leverage redirection graphs (SURFvisits links twice to evaluate the consistency of a redirection,and WarningBird has a notion of hubs), the analysis per-formed by SpiderWeb is more comprehensive, and allowsone to make decisions just by looking at the graph. BothSURF and WarningBird work very well in detecting ma-licious web pages that use redirection chains, in the contextthat they were designed to operate. For instance, SURF isvery e↵ective in detecting malicious webpages advertised bySEO, while WarningBird is e↵ective in detecting maliciousURLs on Twitter. However, attackers can evade detectionby these systems by modifying the way in which they posttheir links, or they can leverage di↵erent methods to spreadtheir malicious content (other than Twitter and SEO). Weshow that, in the presence of an attacker evading the con-text information leveraged by previous systems, SpiderWebperforms a better detection.

We developed two classifiers based on the redirection-chain features of WarningBird [10] and SURF [13]. In thecase of SURF, we used the Total redirection hops, the Cross-site redirection hops, the Landing to terminal distance, andthe IP-to-name ratio. In particular, we re-implemented theLanding to terminal distance as follows: For each redirectionchain, we considered the country of the landing and the ter-minal domain. If the countries resulted being the same, weset this feature to 1. Otherwise, we set it to 0. Note that wecould not re-implement the Redirection consistency and thePage to load/render errors features, since this informationwas not available in S. In fairness, this could have impacted

the performance of our SURF-based classifier in our exper-iment, since the authors mention that these are among themost discriminative features in their classifier.

In the case of WarningBird, we used the Redirectionchain length, the Frequency of entry point URL, the Num-ber of di↵erent initial URLs, and the Number of di↵erentlanding URL features. Note that, in WarningBird’s termi-nology, an entry point is what we call a hub. Also, noticethat we did not use the Position of entry point URL fea-ture. The reason is that WarningBird had a uniform setof referrers (twitter.com pages), while we do not. For thisreason, it was impossible for us to calculate this feature.

We then applied a ten-fold cross validation using SMOfor the two classifiers, using Domain + Page + Parametergroupings built from Tr as a training dataset. The resultsare reported in Table 3. As it can be seen, SpiderWebhas by far a better recall than both other classifiers, and acomparable precision to the SURF-based classifier, which isthe system that generated the least false positives. To befair, both SURF and WarningBird achieve better detec-tion rates by leveraging contextual information, which oursystem does not consider by design. Of course, our systemcould include contextual information to enhance its detec-tion capabilities, but this is out of the scope of this paper.

Interestingly, the SURF-based and WarningBird-basedclassifiers detect redirection graphs as malicious that Spi-derWeb misses, and vice versa. In particular, SpiderWebfails in detecting five redirection graphs in Tr4 as malicious.However, the SURF-based classifier can detect three of thoseas malicious, and the WarningBird-based classifier can de-tect the other two. This means that, by using the threesystems together, one could reach a complete recall.

4.6 Possible Use Cases for SpiderWeb

We envision two possible deployments for our tool: as ano✏ine detection system, and as an online protection system.O✏ine detection. SpiderWeb can be used o✏ine, to an-alyze a dataset of historic redirection chains (similar to theone we used for our evaluation). This allows the system toidentify a set of malicious URLs. The knowledge of theseURLs is useful in several scenarios. For example, it can helpto reveal blind spots in alternative detection tools.Online prevention. The problem with the o✏ine approachis that it introduces a possibly significant delay between theappearance of a malicious page and its detection. A bet-ter way is to deploy SpiderWeb online, so that it receives areal-time feed of redirection chains from users. Note that theAV vendor already receives such a feed from its users; hence,deployment would be relatively straightforward. Whenevera new redirection chain is received, SpiderWeb adds it tothe current set of redirection graphs. Once a graph hasreached a su�cient complexity, the classifier is executed.When a page is detected as malicious, this information canbe immediately pushed out to the clients, blocking any sub-sequent access to this page.

Of course, this approach has the limitation that users arepotentially exploited until the system observes enough dis-tinct redirection chains to identify a malicious page. Toevaluate the extent of this problem, we set up the followingexperiment: First, we took the 562 redirection graphs in ourdataset that SpiderWeb detected as malicious (as discussedin Section 4.4). We then looked at each redirection chain inS, in chronological order (based on the time stamp when

the redirection chain was entered by the user). The goalwas to simulate an online deployment, checking how manyusers would get compromised before SpiderWeb recognizedand blocked a malicious page. We proceeded as follows:Step 1: We keep a list of occurrences O. Each elementin O is a tuple < Ri,Ci >, where Ri is a grouping thatidentifies a malicious redirection graph, and Ci is a set thatcontains the distinct redirection chains that previously leadto URLs in Ri. At the beginning, we have an entry in O foreach of the 562 malicious redirection graphs. The set Ci foreach of these graphs is empty. We also keep a value uc ofusers who have possibly been compromised, and a value up

of users who have been protected by SpiderWeb. At thebeginning, both these values are set to zero.Step 2: For each redirection graph G in S, we check itsfinal page Fin against each grouping Ri in O. If G matchesRi, we add the redirection chain C to Ci.Step 3: If the size of Ci is lower than 6, it means that theuser visiting the page could have potentially been compro-mised3. Therefore, we increase uc by one. Otherwise, weconsider the user as safe, because SpiderWeb would havealready blocked the page, and we increase up by one.

Running this simulation, we find that the number of com-promised users uc is 723, while the number of protectedusers up is 10,191. This means that out of 10,914 visits ofmalicious web pages, SpiderWeb would have successfullyblocked 93% of them. Note that this is a lower bound, be-cause in many cases less than 6 distinct redirection chains areenough to make a successful detection. Clearly, SpiderWebcannot provide perfect protection, but its coverage comparesfavorably to existing blacklists [5]. Also, after identifying apage as malicious, the system knows which users have beenpotentially infected. These observations could be used to in-form the host-based components (e.g., anti-virus scanners)that are already running on those machines.

4.7 Limitations and EvasionAlthough SpiderWeb is a useful tool for detecting mali-

cious web pages, it has some limitations. The first limitationis that SpiderWeb requires redirection graphs of a certaincomplexity to perform an accurate classification. We ac-knowledge that this might be an issue in some cases. Thesecond limitation is that the groupings currently used bySpiderWeb might miss some advanced ways of generatingURLs, such as creating a di↵erent domain, as well as a dif-ferent web page name, for each request. As a result, Spi-derWeb might fail grouping such URLs together. Althoughthis is a problem, as we discussed in Section 3.2, we pickedsuch groupings because they were the only ones available tous, given the dataset S. In a real scenario, other groupings,such as the IP address at which a domain is hosted, couldbe used. A grouping by IP addresses would be non-trivial toevade. A third limitation is that, in principle, the attackermight redirect its victim to a popular, legitimate page afterhaving infected her. By doing this, this particular redirec-tion chain would be merged with the others going to thepopular site, and SpiderWeb would likely not detect thethreat. This is a potential concern, although we did not ob-serve such practice in the wild. A solution to this problemis that SpiderWeb could be modified to build redirectiongraphs ending at pages other than the final page.

3As shown in Section 4.2, with redirection graphs of com-plexity 6 or higher, SpiderWeb reaches complete coverage.

Attackers could also evade detection by making their redi-rection graphs look similar to the legitimate ones. However,in case miscreants successfully evaded detection by our sys-tem, this would make them easier to detect by previouswork. For example, attackers could stop using cloaking. Bydoing this, their websites would look like regular web pages,accessed by a diverse population of browsers and plugins.However, this evasion would leak their malicious JavaScriptin all cases, making it easier for researchers to analyze andblock the malicious code. As another example, attackerscould stop using hubs. By doing this, they would need toredirect their victims to multiple servers under their control(or compromised servers). The e↵ort of setting up this typeof infrastructure raises the bar for attackers. As an alterna-tive, miscreants could redirect their users to the final pagedirectly. However, this would make their redirection chainsless complicated, and easier to be flagged by previous work.

5. RELATED WORKDetecting malicious web pages is an important problem,

and a wealth of research has been conducted on this topic.Typically, proposed systems aim at detecting a particulartype of “malicious” page (e.g., spam or malware). Existingapproaches fall in three main categories: honeypots, intra-page feature- and inter-page feature-based detection systems.Honeypots. Systems based on honeypots typically visitweb pages with a virtual machine, and monitor the systemfor changes that are typical of a compromise (e.g., a registrychange). HoneyC, capture-hpc, and Phoneyc are exam-ples of such systems [15, 20, 21]. Wang et al. presented asystem that runs virtual machines with multiple combina-tions of browsers and plugins [28]. Although this approachhas a higher chance of detecting malicious web pages thanthe ones using a single browser version, it does not scale, asthe number of configurations is constantly increasing.Intra-page features. The second category of systems looksat features that are typical of malicious web pages in order todetect them. Some systems leverage the observation that theURLs used by certain cyber-criminals to point to maliciousweb pages have recognizable patterns [14,31]. Other systemsanalyze the actual web page content, looking for words thatare indicative of the content to be malicious (e.g., typicalof spam or SEO pages) [16, 25, 26]. Another approach isto analyze the JavaScript code in the web page looking forfeatures that are typical of malicious code. This can be donestatically [3], or dynamically, by either visiting the page withan emulated browser [2] or a real one [19]. In general, ahandful of commercial products leverage data collected in adistributed fashion to detect malicious web pages. However,such products look at the reputation of the pages, or at theircontent. Systems based on intra-page features are proneto evasion by obfuscation. For example, attackers mightobfuscate their malicious JavaScript, making it impossiblefor an analyzer to detect it.Inter-page features. The third category of systems looksat features of the di↵erent pages that are involved in the ma-licious process, rather than just of the final malicious page.Zhang et al. developed a system to generate signatures thatidentify URLs that deliver malware, by looking at the redi-rection chains followed by honeypots as they are directedto the malicious domains [32]. Stokes et al. proposed Web-Cop, a system that, starting from known malicious domains,traverses the web graph to find malware landing pages [22].

Similarly to SpiderWeb, other systems leverage HTTPredirection chains to detect malicious domains. SURF isdesigned to detect malicious SEO sites [13], while Warn-ingBird detects malicious web pages posted on Twitter [10].However, both systems use contextual information to maketheir detection accurate. In particular, SURF looks at thesearch engine results that attackers poison, while Warn-ingBird looks at the characteristics of the social networkaccount that posted the link. While they make detectioneasier, these features are relatively easy to evade by thecyber-criminals. MadTracer [11] is a system that aimsat detecting malicious web advertisements by looking at theredirection chains that the users follow. Unlike SpiderWeb,this system uses contextual information about the web ad-vertisement process. Since SpiderWeb does not use anycontextual information about the purpose of malicious webpages, it is more generic, and it can detect multiple typesof malicious campaigns. To the best of our knowledge, weare the first ones to develop a system that is able to make adecision about the maliciousness of a domain by looking atHTTP redirections only.

6. CONCLUSIONSWe presented SpiderWeb, a system that is able to detect

malicious web pages by looking at the redirection chainsthat lead to them. We showed that SpiderWeb can de-tect malicious web pages in a reliable way. We believe thatSpiderWeb complements nicely previous work on detect-ing malicious web pages, since it leverages features that areharder to evade.

7. ACKNOWLEDGMENTSThis work was supported by the O�ce of Naval Research

(ONR) under grant N000140911042, the Army Research Of-fice (ARO) under grant W911NF0910553, the National Sci-ence Foundation (NSF) under grants CNS-0845559 and CNS-0905537, and Secure Business Austria. We would like tothank TrendMicro for their support in the project, as wellas the anonymous reviewers for their insightful comments.

8. REFERENCES[1] Alexa, the Web Information Company. http://www.alexa.com.[2] M. Cova, C. Kruegel, and G. Vigna. Detection and Analysis of

Drive-by-Download Attacks and Malicious JavaScript code. InInternational Conference on World Wide Web, 2010.

[3] C. Curtsinger, B. Livshits, B. Zorn, and C. Seifert. Zozzle:Low-overhead mostly static javascript malware detection. InUSENIX Security Symposium, 2011.

[4] A. Doupe, B. Boe, C. Kruegel, and G. Vigna. Fear the EAR:discovering and mitigating execution after redirectvulnerabilities. In ACM Conference on Computer andCommunications Security (CCS), 2011.

[5] Chris Grier, Lucas Ballard, Juan Caballero, Neha Chachra,Christian J. Dietrich, Kirill Levchenko, PanayiotisMavrommatis, Damon McCoy, Antonio Nappa, AndreasPitsillidis, et al. Manufacturing compromise: The emergence ofexploit-as-a-service. In ACM Conference on Computer andCommunications Security (CCS), 2012.

[6] Hall, M. and Frank, E. and Holmes, G. and Pfahringer, B. andReutemann, P. and Witten, I.H. The WEKA Data MiningSoftware: An Update. In SIGKDD Explorations, 2009.

[7] McAfee Inc. Mapping the Mal Web. Technical report, 2010.[8] A. Kapravelos, Y. Shoshitaishvili, M. Cova, C. Kruegel, and

G. Vigna. Revolver: An Automated Approach to the Detectionof Evasive Web-based Malware. In USENIX SecuritySymposium, 2013.

[9] C. Kolbitsch, B. Livshits, B. Zorn, and C. Seifert. Rozzle:De-Cloaking Internet Malware. In IEEE Symposium onSecurity and Privacy, 2012.

[10] S. Lee and J. Kim. WARNINGBIRD: Detecting SuspiciousURLs in Twitter Stream. In Symposium on Network andDistributed System Security (NDSS), 2012.

[11] Z. Li, K. Zhang, Y. Xie, F. Yu, and X.F. Wang. Knowing YourEnemy: Understanding and Detecting Malicious WebAdvertising. In ACM Conference on Computer andCommunications Security (CCS), 2012.

[12] H. Liu, K. Levchenko, M. Felegyhazi, C. Kreibich, G. Maier,G.M. Voelker, and S. Savage. On the e↵ects of registrarlevelintervention. In USENIX Workshop on Large-Scale Exploitsand Emergent Threats (LEET), 2011.

[13] L. Lu, R. Perdisci, and W. Lee. SURF: Detecting andMeasuring Search Poisoning. In ACM Conference onComputer and Communications Security (CCS), 2011.

[14] J. Ma, L.K. Saul, S. Savage, and G.M. Voelker. BeyondBlacklists: Learning to Detect Malicious Web Sites fromSuspicious URLs. In ACM SIGKDD international conferenceon Knowledge discovery and data mining, 2009.

[15] J. Nazario. PhoneyC: a Virtual Client Honeypot. In USENIXWorkshop on Large-Scale Exploits and Emergent Threats(LEET), 2009.

[16] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. DetectingSpam Web Pages Through Content Analysis. In InternationalConference on World Wide Web, 2006.

[17] J. Platt et al. Sequential minimal optimization: A fastalgorithm for training support vector machines. Technicalreport, 1998.

[18] N. Provos, P. Mavrommatis, M.A. Rajab, and F. Monrose. AllYour Iframes Point to Us. In USENIX Security Symposium,2008.

[19] P. Ratanaworabhan, B. Livshits, and B. Zorn. Nozzle: Adefense against heap-spraying code injection attacks. InUSENIX Security Symposium, 2009.

[20] C. Seifert and R. Steenson. Capture-honeypot Client(capture-hpc), 2006.

[21] C. Seifert, I. Welch, P. Komisarczuk, et al. Honeyc: theLow-interaction Client Honeypot. Proceedings of the 2007NZCSRCS, 2007.

[22] J.W. Stokes, R. Andersen, C. Seifert, and K. Chellapilla.Webcop: Locating Neighborhoods of Malware on the Web. InUSENIX Workshop on Large-Scale Exploits and EmergentThreats (LEET), 2010.

[23] B. Stone-Gross, R. Abman, R. Kemmerer, C. Kruegel,D. Steigerwald, and G. Vigna. The Underground Economy ofFake Antivirus Software. In Workshop on the Economics ofInformation Security (WEIS), 2011.

[24] B. Stone-Gross, R. Stevens, A. Zarras, R. Kemmerer,C. Kruegel, and G. Vigna. Understanding fraudulent activitiesin online ad exchanges. In ACM SIGCOMM Conference onInternet Measurement, 2011.

[25] K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song. Designand Evaluation of a Real-time URL Spam Filtering Service. InIEEE Symposium on Security and Privacy, 2011.

[26] T. Urvoy, E. Chauveau, P. Filoche, and T. Lavergne. TrackingWeb Spam with HTML Style Similarities. ACM Transactionson the Web (TWEB), 2008.

[27] D.Y. Wang, S. Savage, and G.M. Voelker. Cloak and Dagger:Dynamics of Web Search Cloaking. In ACM Conference onComputer and Communications Security (CCS), 2011.

[28] Y.M. Wang, D. Beck, X. Jiang, R. Roussev, C. Verbowski,S. Chen, and S. King. Automated Web Patrol with StriderHoneymonkeys. In Symposium on Network and DistributedSystem Security (NDSS), 2006.

[29] Y.M. Wang and M. Ma. Detecting Stealth Web Pages That UseClick-Through Cloaking. Technical report, Microsoft ResearchTechnical Report, MSR-TR-2006-178, 2006.

[30] B. Wu and B.D. Davison. Detecting Semantic Cloaking on theWeb. In International Conference on World Wide Web, 2006.

[31] Y. Xie, F. Yu, K. Achan, R. Panigrahy, G. Hulten, andI. Osipkov. Spamming Botnets: Signatures and Characteristics.In ACM SIGCOMM Computer Communication Review, 2008.

[32] J. Zhang, C. Seifert, J.W. Stokes, and W. Lee. ARROW:Generating Signatures to Detect Drive-By Downloads. InInternational Conference on World Wide Web, 2011.

Shady Paths: Leveraging Surﬁng Crowds to Detect …chris/research/doc/ccs13_shadypaths.pdf ·...

Documents

Transcript of Shady Paths: Leveraging Surﬁng Crowds to Detect …chris/research/doc/ccs13_shadypaths.pdf ·...