Targeted Attacks Detection with SPuNge - Trend Micro · Trend Micro Targeted Attacks Detection with...

A Trend Micro Research Paper

Targeted Attacks Detection with SPuNge

Dr. Marco Balduzzi Vincenzo Ciangaglini Robert McArdle(Trend Micro Forward-Looking Threat Research Team)

Trend Micro | Targeted Attacks Detection with SPuNge

2

TREND MICRO LEGAL DISCLAIMER

The information provided herein is for general information and educational purposes only. It is not intended and should not be construed to constitute legal advice. The information contained herein may not be applicable to all situations and may not reflect the most current situation. Nothing contained herein should be relied on or acted upon without the benefit of legal advice based on the particular facts and circumstances presented and nothing herein should be construed otherwise. Trend Micro reserves the right to modify the contents of this document at any time without prior notice.

Translations of any material into other languages are intended solely as a convenience. Translation accuracy is not guaranteed nor implied. If any questions arise related to the accuracy of a translation, please refer to the original language official version of the document. Any discrepancies or differences created in the translation are not binding and have no legal effect for compliance or enforcement purposes.

Although Trend Micro uses reasonable efforts to include accurate and up-to-date information herein, Trend Micro makes no warranties or representations of any kind as to its accuracy, currency, or completeness. You agree that access to and use of and reliance on this document and the content thereof is at your own risk. Trend Micro disclaims all warranties of any kind, express or implied. Neither Trend Micro nor any party involved in creating, producing, or delivering this document shall be liable for any consequence, loss, or damage, including direct, indirect, special, consequential, loss of business profits, or special damages, whatsoever arising out of access to, use of, or inability to use, or in connection with the use of this document, or any errors or omissions in the content thereof. Use of this information constitutes acceptance for use in an “as is” condition.

Contents

Abstract ..................................................................................................................................................3

Introduction ...........................................................................................................................................3

Targeted Attack Detection with SPuNge ..........................................................................................4

Preprocessing ............................................................................................................................5

Clustering in Targeted Attack Detection ..............................................................................6

Clustering in SPuNge ...............................................................................................................9

Labeling and Data Reduction ...............................................................................................14

Machine Mapping ...................................................................................................................16

Grouping .................................................................................................................................18

Analysis Framework ...............................................................................................................19

Implementation ...................................................................................................................................19

Duplicate Identification and Optimization ........................................................................20

Distributed Distance Computation .....................................................................................20

Experiments .........................................................................................................................................21


3

Data Set Optimization ...........................................................................................................22

Findings ....................................................................................................................................25

Ethical Considerations ...........................................................................................................29

Related Works ......................................................................................................................................29

Conclusion ...........................................................................................................................................30

References ............................................................................................................................................31


4

Abstract

Over the past several years, we have seen a noticeable rise in the number of reported targeted attacks and advanced persistent threats (APTs). Security experts are seeing a landscape shift from widespread malware attacks that indiscriminately affect systems to those that take a more selective and targeted approach to pursue higher gains. One thing is clear, however, targeted attacks are difficult to detect and little research has been conducted so far on these types of attacks. In this research paper, we propose a novel system we call “SPuNge” that processes threat information collected from actual users to detect potential targeted attacks for further investigation. We used a combination of clustering and correlation techniques to identify groups of machines that share a similar behavior with respect to the malicious resources they access and the industry in which they operate (e.g., oil and gas). We evaluated our system against actual Trend Micro data collected from over 20 million customer installations worldwide. The results show that our approach works well in practice and can assist security analysts in cybercriminal investigations.

Introduction

Over the past several years, we have seen a noticeable rise in the number of reported targeted attacks and APTs. These attacks are carried out by attackers with different motivations but primarily financial gain and espionage. Even though financial gain is a factor for widespread attacks, espionage is more limited to attacks of a targeted nature. Overall, security experts worldwide are seeing a landscape shift from widespread malware attacks that indiscriminately affect systems to those that take a more selective and targeted approach to pursue higher gains.

While it is unlikely that widespread malware attacks will completely vanish or even noticeably decrease in number, almost all industry commentators agree that targeted attacks will continue to increase in volume. This view is also echoed by the media and security company customers who are very concerned about attacks targeting their organizations. Notable examples of recent targeted attacks include Red October and IXESHE.1

One difficulty when discussing targeted attacks is that everyone has a different understanding of what they are. For the purposes of this paper, we will use the following definition: “A targeted attack refers to an electronic attack carried out by a group of attackers against a specific organization, country, or industry with the goal of stealing data or gaining control of company resources.”

1 Softpedia. (January 21, 2013). “AlienVault and Kaspersky Help Organizations Neutralize Red October Attack.” Last accessed June 24, 2013, http://news.softpedia.com/news/AlienVault-and-Kaspersky-Help-Organizations-Neutralize-Red-October-Attack-322919.shtml; David Sancho, Jessa dela Torre, Matsukawa Bakuei, Nart Villeneuve, and Robert McArdle. (2012). “IXESHE: An APT Campaign.” Last accessed June 24, 2013, http://www.trendmicro.com/cloud-content/us/pdfs/security-intelligence/white-papers/wp_ixeshe.pdf.

http://news.softpedia.com/news/AlienVault-and-Kaspersky-Help-Organizations-Neutralize-Red-October-Attack-322919.shtml


http://www.trendmicro.com/cloud-content/us/pdfs/security-intelligence/white-papers/wp_ixeshe.pdf



5

In fact, what sets a targeted attack apart from a widespread attack is purely the motivation behind them and their victims (i.e., targets), while the actual malware or technology adopted is largely irrelevant. For example, a banking Trojan (e.g., ZeuS) infection across 50 countries would be considered a widespread attack while the same attack against two nuclear power plants—and no one else—is an example of a targeted attack. The tool is identical but the motivation of the attackers and the target victims set them apart.

One thing is clear, targeted attacks are difficult to detect and little research has been conducted so far on these types of attacks. In this paper, we proposed a novel system that processes threat information collected from actual users (i.e., consumer and server machines) to detect potential targeted attacks. Often, these attacks have generic detections that do not call them out as targeted in an obvious way. However, using our approach, we were able to reduce millions of normal malicious events down to a more manageable number for further in-depth analysis by:

• Using a combination of clustering techniques to identify groups of machines that share a similar behavior with respect to the malicious resources they request or access (e.g., exploit kits, drive-by downloads, or command-and-control [C&C] servers)

• Correlating the location and industry in which infected machines operate (e.g., oil and gas or government) to discover interesting attack operational details

• Implementing a working prototype of our system, SPuNge

• Evaluating our system by analyzing one week’s worth of Trend Micro threat data from over 20 million user installations worldwide

This paper also describes our approach to detecting potential targeted attacks, how our system is designed, how it uses clustering, how it is implemented, and what solutions we introduced to efficiently analyze data. We also addressed ethical concerns, described how we conducted experiments, and presented our findings.

Targeted Attack Detection with SPuNge

We define a targeted attack as “an electronic attack carried out by a group of attackers against a specific organization, country, or industry with the goal of stealing data or gaining control of a company’s resources (i.e., the victims are often located within one or a few geographic locations, or all operate in the same industry).”


6

Our approach consists of two phases. In the first phase, we analyze the malicious URLs that regular user machines access over HTTP or HTTPS with an Internet browser or any other HTTP client because they are infected by a malware. We identify machines that present a similar network behavior (e.g., accessing web pages used in the same phishing campaign or malware attack). We then apply a combination of clustering techniques to group together similar malicious URLs and “organize” the machines based on the URL clusters they requested. SPuNge comprises six main components that carry out preprocessing, distance matrices computation, clustering, data reduction, machine mapping, and grouping.

Figure 1: SPuNge architecture

In the second phase, we correlate machine clusters that present a similar behavior and identify machines, networks, or organizations that are more likely to be involved in a targeted attack like those that operate in the same industry (e.g., oil and gas). We developed a framework to analyze the results obtained from processing and automatically generate a report for security analysts, which will be discussed in greater detail later.

Preprocessing

The preprocessing stages involve loading and parsing threat data that requires analysis, discarding information irrelevant to detecting targeted attacks and identifying duplicates and redundancies among the URLs.


7

SPuNge processes collections of threat events (i.e., malicious URLs that regular users access over HTTP or HTTPS via an Internet browser or any other HTTP application whose network-connected systems are infected by malware). These URLs (e.g., web pages that drop malware, rogue antivirus solutions, or remote access Trojans [RATs]; distribute drive-by download code; are part of phishing campaigns; and access C&C servers) are classified as “malicious” and blocked at the client side by a security program.

We processed threat events, retaining only events that are relevant in detecting possible targeted attacks. Note, however, that an event consists of the URL and information on the machine that accessed it. We carried out the following to filter data:

1. Classification: We disregarded “parental-controlled” URLs because they are not relevant in detecting targeted attacks (e.g., sites that host pornographic and violent content).

2. Network sampling: We kept a single infected candidate per network and URL (i.e., one IP address per class). We stored information on each infected network but not for each infected machine.

3. Event sampling: If a single machine accessed a URL several times, we only logged this information once (i.e., the first request). A typical scenario of this involves a bot, which regularly pulls commands out of a C&C server.

4. Duplicate removal: If a large number of machines accessed a single URL, we disregarded the event if the number is greater than a preset threshold because this is not relevant in detecting targeted attacks.

5. Whitelisting: We disregarded clusters of popular URLs (e.g., URLs that are massively accessed on a regular basis). More details on this are given in Data Set Optimization.

Clustering in Targeted Attack Detection

Given a set of predefined arbitrary criteria, we grouped similar elements into clusters. Clustering is important to identify patterns among the data we collected without prior knowledge, thus reducing the number of elements that should be examined. This allows us to have an aggregate view of possible attacks. More specifically, clustering allows us to identify groups of malicious events that share similarities like URL hostname or request path.


8

Attackers are known to register and exploit groups of similar domains (e.g., using typosquatting techniques) to instigate attacks like drive-by downloads, phishing campaigns, scams, and click-fraud schemes by leveraging the popularity of the domains they are trying to spoof.

Avast! reported, for instance, that craigslist has been heavily typosquatted with hundreds of domains to lure visitors into taking a fake “quiz” in return for which they would receive prizes like an iPhone®.2 These fraudulent sites typically make money through premium phone calls, selling ads, and reselling email addresses collected from visitors. The following are just some samples of spoofed craigslist sites:

• cr5aigslist.com

• crageslist.com

• craigaslist.org

• creagslist.com

• craigsli8st.com

• craigslisg.com

• craigdlist.com

• craiclist.com

• craigsliost.com

Another report claimed that 10 of the top 50 financial institutions that offer online banking services have been phished with hundreds of fake domains that seemingly looked like the original sites.3 Attackers, for instance, have registered more than 50 spoofed chase.com domains and deployed malicious variants of Chase Bank’s online portal to steal the online banking credentials of unaware victims.

2 Lyle Frink. (March 23, 2012). Avast! Blog. “Misspelling Goes Criminal with Typosquatting.” Last accessed June 25, 2013, https://blog.avast.com/2012/03/23/misspelling-goes-criminal-with-typosquatting/.

3 Joey Hernandez. “Typo Squatting: The Threat Network Defense Teams Overlook.” Last accessed June 25, 2013, http://www.academia.edu/818541/TypoSquatting_-_Malicious_Domains_Malware_Domains_Part_Of_The_Miasmatic_Threat_.

• crauglist.com

• craigslisrt.com

• craighlist.org

• craigslit.ca

• cragslists.com

• craigslistnc.com

• craigslistny.com

• craigilist.com

• craigllist.org

https://blog.avast.com/2012/03/23/misspelling-goes-criminal-with-typosquatting/


http://www.academia.edu/818541/TypoSquatting_-_Malicious_Domains_Malware_Domains_Part_Of_The_Miasmatic_Threat_



9

While domain clustering is an effective approach to detect malicious URLs that share the same characteristics, a complementary approach is to disregard the domain part of the URL and cluster according to the HTTP request string (i.e., path and query string). This is especially useful because botnet operators and exploit kit authors normally do not rely on a single domain for their malicious schemes, as this is generally seen as a single point of failure. Instead, they prefer to organize their infrastructure to span several often-compromised machines and locations and regularly switch from one to another.

Two notable exploits kit samples are the Blackhole Exploit Kit and the Nuclear Exploit Kit, which are currently among the most popular methods of delivering malicious payloads to a victim’s computer like rogue antivirus solutions, banking Trojans (e.g., ZeuS), or ransomware. The attack starts with the installation of an exploit kit into a web page that acts as a landing server while attracting as many victims as possible via spam, for instance. The landing page normally has an obfuscated JavaScript that detects the user machine’s configuration to serve the appropriate exploit. As shown in the table below, a URL’s request can characterize it as a threat related to a particular exploit kit (e.g., Blackhole or Nuclear).

Sample URLs Used in Blackhole and Nuclear Exploit Kit Attacks Detected by SPuNge

Exploit Kit URL Host URL Request

Blackhole http://77.79.13.88 /content/w.php?f=52&e=4

Blackhole http://188.127.249.241 /image/l.php?f=553&e=2

Blackhole http://brown.mydomxd.org /root/w.php?f=2293&e=6

Nuclear http://zeak.rghil.info /a456gh/9493af39692e[...].jar

Nuclear http://163.1.32.2 /1rg54e/55c2b44e0c8a[...].jar

Nuclear http://31.184.244.9 /6ju9a2/bb136b125774[...].jar


10

SPuNge uses a clustering algorithm that allows us to identify groups of malicious URLs that share either similar hostnames (i.e., domains) or requests. As specified in RCF 3986, a uniform resource identifier (URI) schema is defined as “a hostname (e.g., a domain name or an IP address) and a request path as a sum of the path itself and the query string (i.e., a series of parameter-value pairs).”4 As such, clustering URL information involves computing two different sets of distances, namely:

• Distances measuring the similarity between URL hostnames, also known as “host distances”

• Distances measuring the similarity between URL requests intended as path and query strings, also known as “request distances”

It is, therefore, important to choose an appropriate distance (i.e., an appropriate similarity criterion). The similarity criterion is chosen depending on the type of data to cluster and usually takes the form of a distance function that can estimate how “close” two elements of a data set are to one another or a metric (i.e., a function that calculates the unique coordinate of each element in a multidimensional algebraic space). The ability of either placing a data set in a metric space or simply to compute distances between elements is also one of the factors that drive which clustering algorithm to use.

Clustering in SPuNge

Clustering has been widely covered in literature, the most renowned and adopted among which include k-means, used in aggregating traffic to detect malware; x-means, used in detecting Domain Generation Algorithm (DGA)-based malware, or protocol- or structure-independent botnets; and hierarchical clustering, used in detecting HTTP-based malware.5 One of our system requirements is to be able to rapidly and efficiently process data. Due to the prominently textual nature of the data we analyzed, we used a hierarchical single-linkage clustering algorithm, specifically because:

4 NetworkWorkingGroup.(January2005).“UniformResourceIdentifier(URI):GenericSyntax.”LastaccessedJune25,2013,http://www.ietf.org/rfc/rfc3986.txt.

5 A.K. Jain, M.N. Murty, and P.J. Flynn. (September 1999). “Data Clustering: A Review.” Last accessed June 25, 2013, http://www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf;Ting-FangYenandMichaelK.Reiter.(2008).“TrafficAggregationforMalware Detection.” Last accessed June 25, 2013, http://www.rsa.com/rsalabs/staff/bios/tfyen/publications/DIMVA08-TAMD.pdf;ManosAntonakakis.(August10,2012).“FromThrow-AwayTraffictoBots:DetectingtheRiseofDGA-BasedMalware.”Last accessed June 25, 2013, http://www.usenix.org/sites/default/files/conference/protected-files/antonakakis_sec12_slides.pdf; GuofeiGu,RobertoPerdisci,JunjieZhang,andWenkeLee.“BotMiner:ClusteringAnalysisofNetworkTrafficforProtocol-andStructure-Independent Botnet Detection.” Last accessed June 25, 2013, https://www.damballa.com/downloads/a_pubs/Usenix08.pdf; Roberto Perdiscia, Wenke Leea, and Nick Feamstera. “Behavioral Clustering of HTTP-Based Malware and Signature Generation Using Malicious Network Traces.” Last accessed June 25, 2013, https://215a1886-a-62cb3a1a-s-sites.googlegroups.com/site/robertoperdisci/publications/publication-files/nsdi10-final190.pdf?attachauth=ANoY7coAd9uKqooy2xG8HK_2DIFSSsu1-zZAqqSbogbSF6RYYhZNCGEm3x8mfshPEF_V4AsSQpTgrbdH_ROECsCUpWEUAg01-gFv_qXp0_TliEKOTwDRhho_zYHYHJP_aMF8KNqx2p8JwszBE2MU28iNe3O-b8iFsGXj9JtG1V73AY-YTjuZ47IiYDQYSebhlDmC7KwkLhsqeLq039KKegGEmX_ZTNPzvJtiE8lsBHaSJbDCV4IrQ41TJTbqcZQtAtMO1ut6H3Z-hS8ITt5qaX72nZWRCwNPZQ%3D%3D&attredirects=0.

http://www.ietf.org/rfc/rfc3986.txt

http://www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf


http://www.rsa.com/rsalabs/staff/bios/tfyen/publications/DIMVA08-TAMD.pdf


http://www.usenix.org/sites/default/files/conference/protected-files/antonakakis_sec12_slides.pdf

https://www.damballa.com/downloads/a_pubs/Usenix08.pdf


https://215a1886-a-62cb3a1a-s-sites.googlegroups.com/site/robertoperdisci/publications/publication-files/nsdi10-final190.pdf%3Fattachauth%3DANoY7coAd9uKqooy2xG8HK_2DIFSSsu1-zZAqqSbogbSF6RYYhZNCGEm3x8mfshPEF_V4AsSQpTgrbdH_ROECsCUpWEUAg01-gFv_qXp0_TliEKOTwDRhho_zYHYHJP_aMF8KNqx2p8JwszBE2MU28iNe3O-b8iFsGXj9JtG1V73AY-YTjuZ47IiYDQYSebhlDmC7KwkLhsqeLq039KKegGEmX_ZTNPzvJtiE8lsBHaSJbDCV4IrQ41TJTbqcZQtAtMO1ut6H3Z-hS8ITt5qaX72nZWRCwNPZQ%253D%253D%26attredirects%3D0






11

• Algorithms like k-means usually require initially setting the number of clusters, which was unknown in our scenario and had to be computed. Even though variants like x-means do not require initially knowing the number of clusters, it still involves several iterations over the same data set and additional cluster validation at the end of each iteration, thus adding computational costs to the processing.

• Algorithms like k-means require a Euclidean distance to compute the cluster centroids while hierarchical clustering can flawlessly cope with nonmetric distances like the ones used on our data set.

• Hierarchical clustering, in its single-linkage variant, does not require recomputing new distances when clusters are created, thus saving time and resource.

In the first phase, two distinct distance matrices are computed—one that measures how similar each URL pair is based on each URL’s hostname (i.e., host distance) and another that measures how similar their requests are (i.e., request distance). The similarity is determined using the distance functions described later. The algorithm processes both distance matrices and groups similar URLs into two distinct cluster sets—Chost (i.e., has clusters that share similar hostnames or host clusters) and Creq (i.e., has clusters that share similar requests or request clusters). To create each set, the clustering algorithm ran as follows:

1. Each distance (d) was considered in ascending order.

2. The list of all URL pairs (e1, e2) that measured d was parsed.

3. Each pair was assigned to a new cluster (Cnew).

4. If e1 or e2 has already been assigned to a previous cluster (Cold), Cnew assimilates all of the elements in Cold then Cold is discarded.

5. The operation is repeated until a given threshold (T) is reached (i.e., until there are no more pairs at a distance lower than T).

As such, T is the maximum distance allowed to cluster URLs together. The threshold serves two purposes—acting as a termination condition for clustering by defining the size and quality of each cluster and limiting the amount of processing resources needed (i.e., every pair with a distance higher than the threshold can be memory freed).


12

• Host distance: Because hostnames are strings, similarities between them can be determined using one of several well-known functions capable of quantifying how two strings of text share similarities with another. The Hamming distance, for instance, counts the number of bits that have the same position even if they have different values; the Jaccard distance treats text as a set of characters and counts how many characters two sets do not have in common; and the Levenshtein distance is probably the most common.

The Levenshtein distance determines how similar two strings are by obtaining the minimum number of edit operations needed to change one string into the other. An edit operation can be an insertion, a deletion, or a modification of a character in a string. As an example, two edit operations are needed to change the word “Robert” to “Roger”: “Robert”g“Rogert” and “Rogert”g“Roger.” The Levenshtein distance also allows comparing strings of different lengths unlike the Hamming distance and keeps information on duplicate characters and their positions unlike the Jaccard distance.

SPuNge uses the Levenshtein function normalized in the interval [0, 1] to compute the distance dhost between two hostnames. The example in the table below shows the distances computed over the typosquatting domains for craigslist plus an additional domain (i.e., google.com), which is very different from the others. Several desirable properties made it a suitable candidate, including:

1. The distance is 0 for the same domain.

2. The distance is symmetric.

3. A clearly quantifiable difference between similar and nonsimilar domains exists—different domains show a distance much higher than similar ones, making it relatively simple to differentiate two domains (i.e., boldfaced text).


13

Sample Distance Matrix for Hostnames (Normalized Levenshtein)

cr5aigslist.com craigsli8st.com crauglist.com craeglist.com google.com

cr5aigslist.com 0 0.0666 0.1428 0.1428 0.520

craigsli8st.com 0.0666 0 0.1428 0.1428 0.520

crauglist.com 0.1428 0.1428 0 0.0769 0.478

craeglist.com 0.1428 0.1428 0.0769 0 0.478

google.com 0.520 0.520 0.478 0.478 0

• Request distance: As previously mentioned, a URI schema consists of a hostname and a request path as a sum of the path itself dpath and the query string dqsl (i.e., a sequence of parameters and values).

As a consequence, the request distance dreq between two URL requests is more complex than dhost because it is composed of two distinct metrics. That means we use the normalized Levenshtein function to determine similarities between paths and the Jaccard function to determine similarities between query strings. Note that we apply the Jaccard function only to the query string’s parameters and not their values by counting how many parameters the two requests have in common. We disregard the values because they often change and poorly characterize a request.


14

Figure 2: The request distance dreq is a result of a comparison of the similarities of dpath and dqsl

The request distance shown in Figure 2 is a result two subdistances obtained using the formula:

WeightFactor is a numerical factor that rescales dqsl so that dqsl and dpath bear an equal contribution to dreq. The rule of thumb to calculate WeightFactor is that the threshold Tpath and Tsql for both components should be the same. dpath uses the same Levenshtein distance as dhost and the two thresholds, and Tpath and Thost have the same value (i.e., 0.15). Two paths with a distance lower than 0.15 are considered similar. However, we consider two query strings similar if at least half of their parameters are common. Hence, the threshold of Tqsl is 0.5. To normalize Tqsl to 0.15, we set WeightFactor to 0.333. Applying equation 1 to Tpath and Tqsl gives a compound threshold Treq of 0.15√2 = 0.212.


15

Labeling and Data Reduction

When clustering is carried out, the processed events are “organized” into two orthogonal cluster sets that contain URLs grouped according to our two similarity criteria (i.e., hostname and request).

Sample Host Clusters

Cluster Cluster Label Event Hostname

C1 H zfmudav4aaq33r5.com

e1e2e3e4

zfmudav4aaq33r5.comzfmudav4aaq35r5.comzfmudav3aap36r5.comzfmudav2acq35r4.com

C2 H facebookc.com

e5e6e7e8

facebookc.comfacaebook.comfaceboook.comfacebopok.com

C3 H h-aelameftzgj4vxient.com

e9e10e11e12

h-aelameftzgj4vxient.comh-aelameftxcd5vxient.comh-aelameftssd6vxient.comh-aelanfftzgj1vxient.com

Sample Request Clusters

Cluster Label Event Request

C4: R/get2.php?c=BLMEUGUBd=266

e1e2

/get2.php?c=BLMEUGUBd=266/get.php?c=ZLXULJNRd=266

C5: R/9MzImdHA9MCZmbD0w0 e3e4

/9MzImdHA9MCZmbD0w0/9MzImdHB9MCZmbD0w1

C6: R/qKA0rO4d8I7qBhS7Y2xrPTQu

e9e10e11e12

/qKA0rO4d8I7qBhS7Y2xrPTQu/IkG1yP3L8q5YPtU7Y2xrPTQu/BAq3T78d8l5Q7bs0Y2xrPTQu/pA71gKND6P5MTls9Y2xrPTQu


16

Cluster labeling involves assigning “human-friendly” labels to clusters. Data reduction, meanwhile, involves reducing the number of clusters by identifying redundancies and merging clusters.

We introduced labels to rapidly visualize the content of a cluster (i.e., without the need to inspect the URLs contained within). We labeled each cluster using the following conventions:

• The host cluster labels are prefixed with an “H,” followed by the hostname of the cluster’s event.

• The request cluster labels are prefixed with an “R,” followed by the query string of the cluster’s event.

Data reduction involves going through both cluster sets—C Host and C Req—and performing cluster merging when certain conditions are met. Merging two clusters of different types involves discarding one of the two and updating the label of the “survivor” to reflect the merger. The following are some possible results:

1. If C Host and C Req contain the same information, C Req is discarded and C Host’s label is updated to CHOST-LABEL = : CREQ-LABEL.6

2. If C Host is a subset of C Req, C Host is discarded and C Req’s label is updated to CREQ-LABEL >: CHOST-LABEL.

3. If C Req is a subset of C Host, C Req is discarded and C Host’s label is updated to CHOST-LABEL >: CREQ-LABEL.

6 The choice of which cluster to discard is purely arbitrary since the clusters contain the same events.


17

Sample Merged Clusters

Cluster Cluster Label Event URL

C1

H zfmudav4aaq33r5.com >: R/get2.php?c=BLMEUGUBd=266 > : R/9MzImdHA9MCZmbD0w0

e1e2e3e4

zfmudav4aaq33r5.com/get2.php?c=BLMEUGUBd=266zfmudav4aaq35r5.com/get.php?c=ZLXULJNRd=266zfmudav3aap36r5.com/9MzImdHA9MCZmbD0w0zfmudav2acq35r4.com/9MzImdHB9MCZmbD0w1

C2 H facebookc.com

e5e6e7e8

facebookc.comfacaebook.comfaceboook.comfacebopok.com

C3H h-aelameftzgj4vxient.com =: R/qKA0rO4d8I7qBhS7Y2xrPTQu

e9e10e11e12

h-aelameftzgj4vxient.com/qKA0rO4d8I7qBhS7Y2xrPTQuh-aelameftxcd5vxient.com/IkG1yP3L8q5YPtU7Y2xrPTQuh-aelameftssd6vxient.com/BAq3T78d8l5Q7bs0Y2xrPTQuh-aelanfftzgj1vxient.com/pA71gKND6P5MTls9Y2xrPTQu

The table above shows the results of the merging process performed for the sample host and request clusters. The labeling convention we adopted provides a quick understanding of the clusters’ content and the relationships between the merged clusters. C1’s label, for instance, tells us that C1 is a cluster of URLs that have hostnames similar to zfmudav4aaq33r5.com and requests that can be grouped into those similar to /get2.php?c=BLMEUGUBd=266 and those akin to /9MzImdHA9MCZmbD0w0. A second cluster, C3, contains URLs with hostnames similar to h-aelameftzgj4vxient.com and requests akin to /qKA0rO4d8I7qBhS7Y2xrPTQu.

Machine Mapping

Up to this point, the processed results contain information on similar malicious URLs, which we clustered together. The machine mapping component aims to identify and correlate the malicious request sources (i.e., by finding out which groups of machines sent out requests to the same cluster and which clusters machines accessed). We wanted to know which machines exhibited the same malicious network behaviors, for instance, because they were targeted by the same phishing campaign or were part of the same botnet.

To do this, we carried out the following transformations on the merged clusters:

1. For each cluster, we extracted a user machine identifier from all cluster events (i.e., the machine’s IP address in anonymized form to address privacy concerns, described in more detail in Ethical Considerations).

2. We then produced an association table (see below) in the form, clustergmachine.


18

Sample ClustergMachine Associations

Cluster Cluster Label Event Source Machine

C1

H zfmudav4aaq33r5.com > : R/get2.php?c=BLMEUGUBd=266 >: R/9MzImdHA9MCZmbD0w0

e1e2e3e4

M1M2M3M4

C2 H facebookc.com

e5e6e7e8

M1M2M5M6

C3H h-aelameftzgj4vxient.com =: R/qKA0rO4d8I7qBhS7Y2xrPTQu

e9e10e11e12

M3M4M5M7

3. We then built a second table in the form, machinegcluster (see below). Note that the two tables are the reverse format of each other.

Sample MachinegCluster Associations

Source Machine Cluster

M1 C1, C2

M2 C1, C2

M3 C1, C3

M4 C1, C3

M5 C2, C3

M6 C2

M7 C3


19

Grouping

In grouping, we want to identify groups of machines that request the same set of clusters (i.e., more than one), as attacks often involve a multistep process (i.e., the attack is carried out in different phases). A well-known example is a drive-by download attack wherein the victim is first redirected to the malicious page then served the right exploit.

While so far the information we collected in the machinegcluster association tells us which individual machine accessed which destinations, we carried out one last operation to group clusters and machines together. Doing so allowed us to identify:

• If groups of uncorrelated resources (e.g., websites) were used in the same attack or campaign (e.g., malware or phishing campaign)

• If groups of machines requested the same malicious resources, for instance, because they were infected with the same malware variant

• If machines accessed a significantly high number of malicious resources, for instance, because they were heavily infected

The table below shows the results. As shown, two machines {M1, M2} that belong to the same group of clusters {C1, C2} and a second group of machines {M3, M4} share the group, G2.

Sample Groups (Machines and Clusters)

Group Machine Set Cluster Set

G1 M1, M2 C1, C2

G2 M3, M4 C1, C3

G3 M5 C2, C3

G4 M6 C2

G5 M7 C3


20

Analysis Framework

We discovered which groups of machines exhibited similar malicious network behaviors like accessing websites involved in the same phishing or malware campaign. We used a combination of clustering techniques to cluster malicious URLs that are similar to one another and “organized” the machines based on the URL clusters they accessed.

Next, we developed an automated analysis framework to analyze each group of machines and identify potential candidates of targeted attacks. We correlated their information (i.e., the industries wherein they operate and their geographic locations).

We ran two types of analysis—first on the cluster set and another on the groups of clusters (i.e., more than one). In the cluster analysis, we looked for N+ machines operating in the same industry or country, or a combination of both, and generating requests to URLs clustered together because they were similar. We made N vary between 2 and 5. In the group analysis, we searched for groups of N+ machines that shared C or more clusters. We used the results of the grouping previously described and made both N and C vary between 2 and 5. We verified that the identified machines did not share any behavior with other groups (i.e., a sign of a widespread attack).

The machines that matched our criteria were included in an automatically generated report for security analysts. More detailed results from our system will be shown in Experiments.

Implementation

SPuNge is a Python 2.7 application, a prototype that analyzes threat information using clustering and correlation techniques to discover potential targeted attacks and victims.

Our implementation, however, encountered several challenges, including:

• The amount of threat data to process reached millions

• Some processing stages were highly complex (e.g., given N requests, we computed

distances

• To allow a continuous run on a single machine, the analysis over H hours of threat data must last less than H hours

• The processing has to be recoverable in the event of a crash (i.e., as modern file systems guarantee)


21

Duplicate Identification and Optimization

On top of the preprocessing filtering operations, we also performed some additional data optimization operations in the initial stages. Note that the threat data can be redundant because the same URL can be requested several times by several infected machines. The strategy we adopted to handle this “special case” is to organize the data into two groups of duplicate and unique URLs in the preprocessing stage. We then restricted the computation of the distance matrices to a sample per duplicate group.

We particularly relied on a hash-based algorithm to identify duplicate elements. Upon parsing each URL, we computed two MD5 hashes (i.e., one per hostname and one per request) and grouped together URLs that had the same hostname or request. In the clustering stage, we randomly picked one “candidate” URL for each duplicate group to include in the computation of distance matrices.

Our empirical experiments over millions of requests showed that our approach reduced the clustering overhead time to 19%, resulting in around 10% less URLs to “distance-compute.”

Distributed Distance Computation

Distance computation is demanding on computing resources. Fortunately, it is also an operation over uncorrelated data that can run in distributed form. We designed SPuNge to use an algorithm that supports multiprocessing, which is now commonly found in modern multicore CPUs. Extending the system to distribute computing over several machines was relatively straightforward and did not involve making any change to the code, as this is one of the many frameworks available for Python.7

We used this algorithm to compute the two distance matrices (i.e., host and request) and to find the machinegcluster groups in Grouping. The algorithm works this way:

• We built a linearized event matrix as a list of event pairs, also known as the “distances list.”

• To limit memory consumption, we split the distances list into N iteration segments for sequential processing. This way, we can run the process on memory-bound systems by configuring the number of iteration segments, N, accordingly.

7 Python. “Parallel Processing and Multiprocessing in Python.” Last accessed June 26, 2013, http://wiki.python.org/moin/ParallelProcessing.

http://wiki.python.org/moin/ParallelProcessing



22

• Each iteration was split into worker segments (i.e., sublists of request pairs) and a pool of worker processes was instantiated to process each segment (i.e., one process per segment). Workers had exclusive read access to the memory area containing the segments by avoiding computational overhead due to message passing, mutex locking, or possible race conditions. The results were returned in a shared queue.

• A collector waits for the results, receives the computed distances, and organizes them into a list of pairs that measured that much. This made it easier to fetch distances for clustering. To optimize processing, the garbage collector is manually controlled. We suspended it during the intense distance computation stage and re-enabled it when we completed processing each iteration.

• After processing each iteration, each partial result is stored on disk to make the whole process fault-tolerant. Note that we performed this operation after every processing stage. We used an efficient C-based Python serialization library called “cPickle,” which we found worked very fast.8

Experiments

We used SPuNge on actual threat data collected by Trend Micro from users in order to evaluate the possibility of detecting potential targeted attack campaigns. The Trend Micro™ Smart Protection Network™ cloud-based infrastructure collects about 6TB of threat data per day from over 20 million customer installations worldwide.9 The threat data is collected on an hourly basis and made available to analysts in the form of data feeds with an application programming interface (API) and a web application.

We based our analysis on a data feed that collects information on the malicious URLs users access over HTTP or HTTPS. When a user accesses a known malicious URL because it hosts a RAT or malware, or it is a C&C channel, for instance, the network security component of the antimalware solution generates an event for the Smart Protection Network. This event contains information on the requested URL and the user application that requested it, together with his/her machine’s configuration. The event has the following aggregate fields:

• The time when the event occurred (GMT converted)

• The URL requested and the IP address of the web server at the time the request was made

• The process that generated the HTTP or HTTPS request (i.e., name, size, and hash)

8 Python Software Foundation. (June 25, 2013). “11.1. pickle—Python Object Serialization.” Last accessed June 26, 2013, http://docs.python.org/2/library/pickle.html#module-cPickle.

9 Trend Micro Incorporated. (2013). “Smart Protection Network—Data Mining Framework.” Last accessed June 26, 2013, http://cloudsecurity.trendmicro.com/us/technology-innovation/our-technology/smart-protection-network/index.html.

http://docs.python.org/2/library/pickle.html%23module-cPickle


http://cloudsecurity.trendmicro.com/us/technology-innovation/our-technology/smart-protection-network/index.html



23

• The IP address, geographic location (i.e., country), and OS version of the machine that generated the event

• The industry (i.e., banking, communications and media, education, energy, fast-moving consumer goods, financial, food and beverage, government, healthcare, insurance, manufacturing, materials, media, oil and gas, real estate, retail, technology, telecommunications, transportation, and utilities) wherein the entity attacked operates

We performed our experiments using one week’s worth of threat data spanning November 11–17, 2012 by deploying two physical machines (i.e., A and B) in our testing infrastructure. We used A to process the data via SPuNge and B to analyze the results, which we transferred from A to B over the network. B runs a PostgreSQL database. The machines were configured this way:

• A: 16-core Intel® Xeon 2.40GHz, 72GB RAM, 3.5TB hard disk

• B: 8-core Intel Xeon 2.83GHz, 16GB RAM, 4.0TB hard disk

Data Set Optimization

When we performed the experiments, we decided to analyze each day’s worth of data individually (i.e., to make use of the results of day N as input for the following processing N + 1. In fact, we found that the data set we used to evaluate SPuNge contained “polluted” information like web pages that did not appear to be malicious any longer or have been taken down. And, as a consequence of the use of inaccurate signatures in URL-pattern matching, URLs that had, for instance, /icon.ico or /favicon.ico as path were considered malicious even if the file exists and is indeed malicious. Finally, when the reputation is computed at the hostname level, we noticed that some web pages were filtered because they were hosted on a malicious IP address or domain, regardless of their nature. As a consequence, we observed several machines requesting the same group of URLs (i.e., generating events that are not necessarily associated with potential targeted attacks).

To cope with this limitation, after each day of analysis, we automatically extracted clusters of URLs that were requested by more than N machines because they were more likely to be used in widespread rather than targeted operations. Interesting results were revealed when we eliminated a big chunk of the clusters that had more than N = 25 machines.


24

Our findings showed that our approach works. In fact, after the first two days when we massively learned how poor the quality of the URLs were, we reduced the number of events to process from 300–500,000 to roughly 200,000 events. This reduced the processing time to half and incremented the quality of the results (i.e., we saw fewer clusters with a big number of machines (see the table below). The URLs that we identified daily as “polluted” were added to the exclusion list. These amounted to 252,998; 191,891; 112,627; 3,459; 2,255; and 2,413, respectively.

Processing Results

Number of... Sunday (11)

Monday (12)

Tuesday (13)

Wednesday (14)

Thursday (15)

Friday (16)

Saturday (17)

Raw events (M)Processed eventsProcessed machinesDetected clustersDetected groups

2.79

387,339

10,866

4,106

2,144

5.17

536,524

15,581

8,825

3,941

5.58

256,270

15,413

8,195

3,579

5.68

221,954

15,391

7,825

3,528

5.22

230,758

14,165

7,196

2,679

4.91

269,103

14,364

7,281

2,896

2.62

329,458

8,406

3,869

1,069

Hostname clustering

Clustered eventsComputed pairs (M)Iterations

20,433

2081

46,361

1,0745

44,362

9834

38,663

7473

37,232

6933

41,349

8544

24,596

3022

Request clustering

Clustered eventsComputed pairs (M)Iterations

111,073

6,16825

192,075

18,44674

159,294

12,68751

117,271

6,87628

146,279

10,69842

136,137

9,26638

138,335

9,56839

Processing time (seconds)

PreprocessingMatrices computationClusteringMergingMachine mappingGrouping

268

8,7083,1982,089

27451

769

21,27310,66531,370

1,21493

665

15,3074,906

23,040

48874

667

9,9882,5009,632

18762

604

13,3954,983

18,614

44167

527

11,6754,739

13,581

41461

302

9,68312,9875,082

1,18748

Total (hours:minutes)

04:15 18:09 12:32 06:31 10:44 08:47 08:28


25

Figure 3 shows the machine allocations in the different clusters (i.e., how many machines are clustered into groups of increasing size). The blue bars show the allocation during the first run on November 11 when no exclusions were made due to lack of previous information. The red bars, meanwhile, show the machine allocation on November 17 when an entire week’s worth of information coming from previous clusters was used to skim the raw data. The peak on the left was due to a large number of small clusters (i.e., 2–3 machines) while the tail on the right was made of a few clusters resulting from the merger of numerous events. The interesting result was due to the elimination of a big chunk of clusters with more than N = 25 machines.

Figure 3: Machine distribution across different clusters


26

In sum, we configured SPuNge to have the following parameters and ran our automated processing using an entire week’s worth of threat data with a clustering iteration size of 250 million, a grouping iteration size of 20 million, a duplicate and pollution threshold (N) of 25 events, a clustering threshold (T) of 0.15 for the host clusters and 0.15√2 for the request clusters, then we excluded them from the processing. This allowed us to first, avoid reprocessing the same URLs and focus on analyzing fresh data, and second, to keep only the more interesting information.

Findings

As shown in the table of processing results, our data set consists of about 5 million events on weekdays (i.e., from Monday to Friday) and 2.5 million on Sunday and Saturday (i.e., nonworking days in most of the users’ countries). After preprocessing filters were applied, we only analyzed an average of 200–300,000 events out of the total number.

We then further reduced this number to events that had unique URL hostnames or requests. We ran the distance matrices computation only on these URLs. We ended up creating two groups of data—the first had about 20–40,000 unique URL hostnames, which we then “host-clustered,” while the second had 100–200,000 unique URL requests, which we then “request-clustered.” The second group was much larger because malware authors often use variation patterns for the path or query string parts of a request. Typical examples of these are exploit kits like the Blackhole Exploit Kit and the Nuclear Exploit Kit, which use a variety of path parameters to identify the right exploit to serve to a victim based on his/her machine’s configuration.

We used clustering to achieve this result even if it was a slow approach. As previously mentioned, we carefully took into account this challenge, and designed and implemented appropriate solutions. Our empirical results on an actual data set proved that our system can process data online (i.e., the processing time was smaller than the data set interval). In fact, our timing measurements showed that SPuNge can handle millions of raw threat events collected each day from over 20 million user installations in less than half a day and using a single 16-core machine.

Each day, our system builds aggregate clustering information out of millions of raw events collected at the machine level. We particularly detected an average of 7,882 clusters and 3,306 groups per day as shown in the table below. This is a group of machines infected by a newly discovered bot that disguises itself by injecting code into the standard Windows® application, ping.exe. This cluster, which we labeled “R/sVv4VmLE8Z5Mdzc9Y2xrPTQuNyZia” (i.e., a request base cluster), was identified as a consequence of infected machines that communicated with a C&C server using a similar URL request.


27

Ping.exe Malware Details

URL Host URL Request Host Machine Country Process

Name

http://83.133.124.191 HVM2wppE5M7mnPC7Y2xrPTQuOCZ M1 United States ping.exe

http://83.133.124.191 vA21k6yD7N5XAKC8Y2xrPTQuOCZ M2 United States ping.exe

http://46.249.59.47 4zk3oUup7K7xjOS0Y2xrPTQuNyZ M3 United States ping.exe

http://63.223.106.17 rK61TBkp5a3mCgS4Y2xrPTQuNyZ M4 Australia ping.exe

The following shows two sample SPuNge detections via the correlation framework introduced in the Analysis Framework section. By looking for machines that exhibited similar behaviors and correlating their information (i.e., the industries and countries wherein they operate), we revealed victims of potential targeted attacks.


28

Cluster 7543 – H 146.185.246.116 >:R / p98a.exe >:R / dd.exehttp://146.185.246.111/p98a.exe NETWORK 1notepad.exe2012−11−1309:50:35http://146.185.246.116/p18a.exe NETWORK 1notepad.exe2012−11−1309:50:37[ . . . ]http://146.185.246.121/mailsa.exe NETWORK 1notepad.exe2012−11−1309:50:24http://146.185.246.101/lmqa.exe NETWORK 1notepad.exe2012−11−1309:50:26http://146.185.246.63/dd.exe NETWORK 2svchost.exe2012−11−1311:45:27http://146.185.246.63/dd.exe NETWORK 3svchost.exe2012−11−1320:58:55http://146.185.246.104/dqs.ex NETWORK 1notepad.exe2012−11−1309:47:36NETWORK 1 Technology Mexico Windows 5.1NETWORK 2 Technology Turkey Windows 5.1NETWORK 3 Technology Morocco Windows 5.1

Listing 1: Russian Business Network (RBN) sample—technology industry

In Listing 1, we had three distinct class B networks that all belonged to companies that operated in the technology industry. These were located in three separate and remote locations—Mexico, Turkey, and Morocco. These events were clustered together because they shared similar hostnames (e.g., IP addresses in a contiguous space) and paths that consist of binary files with short names. The malware was used to take control of infected machines via a backdoor. It injects code into the memory space of legitimate Windows programs (i.e., notepad.exe and svchost.exe) to avoid easy detection. All of the IP addresses the machine accessed belonged to the same netblock in Russia, which was registered with information that belonged to threat actors. This netblock had a history of maliciousness and was associated with the RBN, which provides support and other customized services and malware to targeted attack operations.10

The next example—Listing 2—shows two networks, both identified as Malaysian companies in the oil and gas industry. The machines seated in these two networks reached out on November 14 to two clusters of C&C servers using a process called “r18nwn.exe.” Our system grouped these clusters together because they were exclusively accessed by the same victims and part of the same attack.

10 Jeffrey Carr. (January 15, 2013). Digital Dao: Evolving Hostilities in the Global Cyber Commons. “RBN Connection to Kaspersky’s Red October Espionage Network.” Last accessed June 26, 2013, http://jeffreycarr.blogspot.ca/2013/01/rbn-connection-to-kasperskys-red.html; Symantec. “Anatomy of a Data Breach: Why Breaches Happen and What to Do About It.” Last accessed June 26, 2013, http://eval.symantec.com/mktginfo/enterprise/white_papers/b-anatomy_of_a_data_breach_WP_20049424-1.en-us.pdf.

http://jeffreycarr.blogspot.ca/2013/01/rbn-connection-to-kasperskys-red.html


http://eval.symantec.com/mktginfo/enterprise/white_papers/b-anatomy_of_a_data_breach_WP_20049424-1.en-us.pdf



29

Group 1245, 2 Clusters, 2 NetworksCluster 1725,Label: R / list.php?c=140C3[ . . . ] =:H w.nucleardiscover.com:888E1: http://w.nucleardiscover.com:888/list.php?c=140C34E31DAB3B9746[ . . . ]&t=0.689831&v=2E2: http://w.nucleardiscover.com:888/list.php?c=D8C08B5CD1670FA396[ . . . ]&v=1&t=0.9288141Cluster 1932, Label: R/ggggr.jpg?t=0.1424164E1: http://61.147.99.179:81/ggggr.jpg?t=0.1424164E2: http://ru.letmedo.net:2011/myck.jpg?t=0.3245672NETWORK1: Oil and Gas Malaysia Windows 5.1r18nwn.exe (HASHHERE) 2012−11−14NETWORK2: Oil and Gas Malaysia Windows 5.1r18nwn.exe (HASHHERE) 2012−11−14

Listing 2: Sample cluster group—oil and gas industry

McAfee also saw this attack localized to the South Asia region and referred to it as “one that involved a malware that ‘spread by transmission to a removable medium such as a removable disk, a writable CD, or a USB drive.’”11 This is a common methodology followed by malware authors who target networks that may not be readily connected to the Internet, for instance, those owned by companies operating in industrial environments. The malware waits for commands from an attacker as opposed to carrying out automated activity. This is very uncommon in widespread attacks but is part of the standard modus operandi of targeted attack groups, especially those based in China. We used the historical data provided by the DomainTools service to verify that the domain was originally registered to a person located in China.12 His name is also linked to several other malicious domains employed in targeted attack operations.

11 McAfee, Inc. (February 21, 2012). “W32/Virut.gen!1ED0DD2F830C.” Last accessed June 26, 2013, http://www.mcafee.com/threat-intelligence/malware/default.aspx?id=854659.

12 DomainTools, LLC. DomainTools. Last accessed June 26, 2013, http://www.domaintools.com/.

http://www.mcafee.com/threat-intelligence/malware/default.aspx%3Fid%3D854659


http://www.domaintools.com/


30

Ethical Considerations

Experiments involving actual data collected from customers may be considered an ethically sensitive issue. One clear question that arises is whether it is ethically acceptable and justifiable to conduct experiments that involve actual data of users. Similar to the experiments conducted by Jakobsson, et al. and previous work in the field, we believe that such experiments are the more effective method to estimate the actual success rates of detection systems.13 In the experiments we described in this paper, we considered the users’ privacy and the sensitivity of the data that was collected. Identifiers (e.g., IP addresses of their machines) were anonymized and any information that could reveal the identities of users was removed. Finally, since the experiments were carried out in Europe, all such experiments were performed in compliance with privacy regulations.

Related Works

Thonnard, et al. provided an analysis of APT campaigns focusing on studying the characteristics of a known set of targeted attacks delivered via email attachments and evaluated the prevalence and sophistication levels of such attacks by analyzing the malicious attachments used.14 Another paper focusing on APT campaign analysis used graphs to give a broad view of the campaigns rather than individually analyzing them.15 By associating attacks with shared targets, it is possible to build a map of APTs activities and identify clusters that could represent common activities from a single team. Liu, et al. showed how to determine the number of most likely victims of a known targeted attack by improving detection rates and reducing the number of false positives with regard to N-gram based approaches.16 Despite providing readers with insights on APTs, however, the above-mentioned works focus on analyzing already-identified campaigns while our work presents a new methodology that leverages clustering techniques to identify and single out possible attacks that have yet to be discovered.

13 M. Jakobsson, P. Finn, and N. Johnson. “Why and How to Perform Fraud Experiments.” Last accessed June 26, 2013, http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4489852&sortType%3Dasc_p_Sequence%26filter%3DAND(p_IS_Number%3A4489835); Markus Jakobsson and Jacob Ratkiewicz. “Designing Ethical Phishing Experiments: A Study of (ROT13) rOnl Query Features.” Last accessed June 26, 2013, http://wwwconference.org/www2006/programme/files/pdf/3533.pdf.

14 Olivier Thonnard, Leyla Bilge, Gavin O’Gorman, Seán Kiernan, and Martin Lee. (2012). “Industrial Espionage and Targeted Attacks: Understanding the Characteristics of an Escalating Threat.” Last accessed June 26, 2013, http://link.springer.com/chapter/10.1007%2F978-3-642-33338-5_4.

15 M. Lee and D. Lewis. “Clustering Disparate Attacks: Mapping the Activities of the Advanced Persistent Threat.” Last accessed June 26, 2013, http://www.academia.edu/2352875/CLUSTERING_DISPARATE_ATTACKS_MAPPING_THE_ACTIVITIES_OF_THE_ADVANCED_PERSISTENT_THREAT.

16 Shun-Te Liu, Yi-Ming Chen, and Hui-Ching Hung. (2012). “N-Victims: An Approach to Determine N-Victims for APT Investigations.” Last accessed June 26, 2013, http://link.springer.com/chapter/10.1007%2F978-3-642-35416-8_16.

http://ieeexplore.ieee.org/xpl/articleDetails.jsp%3Farnumber%3D4489852%26sortType%253Dasc_p_Sequence%2526filter%253DAND%28p_IS_Number%253A4489835%29



http://wwwconference.org/www2006/programme/files/pdf/3533.pdf

http://link.springer.com/chapter/10.1007%252F978-3-642-33338-5_4


http://www.academia.edu/2352875/CLUSTERING_DISPARATE_ATTACKS_MAPPING_THE_ACTIVITIES_OF_THE_ADVANCED_PERSISTENT_THREAT




31

Clustering has been widely used in data mining as a technique to find common patterns and similarities among huge sets of data. Recently, however, security researchers have been applying it as a way to perform traffic aggregation and classification in order to identify botnets and other threats. Perdisci, et al. proposed a clustering-based approach to analyzing HTTP-based malware traffic in order to improve signature generation. They also introduced a solution to identify botnets by analyzing network traffic that requires no prior knowledge (i.e., no previous assumptions on the botnets’ network behavior) in another paper. They were able to successfully identify Internet Relay Chat (IRC)-, HTTP-, and peer-to-peer (P2P)-based botnets. In yet another paper, they introduced a novel approach to identify DGA-based malware domains. In lieu of reverse-engineering DGA algorithms to identify the domains botnets used, they proposed the use of clustering to analyze NXDomain traffic under the assumption that most DGA-generated Domain Name System (DNS) queries would result in NXDomain responses. Finally, Yen, et al. offered a solution to identify candidates of botnet-infected machines by analyzing aggregation patterns of network traffic. Our case, meanwhile, relied on network traffic that has already been classified as “malicious.” On a different note, Ulrich, et al. exploited clustering techniques to perform behavior-based malware classification.17 Unlike network clustering, our work clustered malware samples based on host-based features like system calls and memory objects.

The works described above did not explicitly address the problem of identifying potential targeted attacks via malicious traffic aggregation and machine information correlation. We believe that the opportunity to use this methodology in APT detection is big and, to the best of our knowledge, still unexplored.

Conclusion

We introduced a novel system we call “SPuNge” to process threat information collected from actual users to detect potential targeted attacks for further investigation. We used a combination of clustering techniques to identify groups of machines and networks, which are possibly involved in the same attack. In addition, we showed how we correlated victim industry and country information to reduce the millions of normal malicious events down to a more manageable amount for further in-depth analysis. We evaluated our system against one week of Trend Micro data collected from over 20 million user installations worldwide. Our results show that our approach works well in practice and is helpful in assisting security analysts in cybercrime investigations.

17 Ulrich Bayer, Paolo Milani Comparetti, Clemens Hlauschek, Christopher Kruegel, and Engin Kirda. “Scalable, Behavior-Based Malware Clustering.” Last accessed June 26, 2013, http://www.cs.ucsb.edu/~chris/research/doc/ndss09_cluster.pdf.

http://www.cs.ucsb.edu/~chris/research/doc/ndss09_cluster.pdf


32

References

• A.K. Jain, M.N. Murty, and P.J. Flynn. (September 1999). “Data Clustering: A Review.” Last accessed June 25, 2013, http://www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf.

• David Sancho, Jessa dela Torre, Matsukawa Bakuei, Nart Villeneuve, and Robert McArdle. (2012). “IXESHE: An APT Campaign.” Last accessed June 24, 2013, http://www.trendmicro.com/cloud-content/us/pdfs/security-intelligence/white-papers/wp_ixeshe.pdf.

• DomainTools, LLC. DomainTools. Last accessed June 26, 2013, http://www.domaintools.com/.

• Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee. “BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure-Independent Botnet Detection.” Last accessed June 25, 2013, https://www.damballa.com/downloads/a_pubs/Usenix08.pdf.

• Jeffrey Carr. (January 15, 2013). Digital Dao: Evolving Hostilities in the Global Cyber Commons. “RBN Connection to Kaspersky’s Red October Espionage Network.” Last accessed June 26, 2013, http://jeffreycarr.blogspot.ca/2013/01/rbn-connection-to-kasperskys-red.html.

• Joey Hernandez. “Typo Squatting: The Threat Network Defense Teams Overlook.” Last accessed June 25, 2013, http://www.academia.edu/818541/TypoSquatting_-_Malicious_Domains_Malware_Domains_Part_Of_The_Miasmatic_Threat_.

• Lyle Frink. (March 23, 2012). Avast! Blog. “Misspelling Goes Criminal with Typosquatting.” Last accessed June 25, 2013, https://blog.avast.com/2012/03/23/misspelling-goes-criminal-with-typosquatting/.

• Manos Antonakakis. (August 10, 2012). “From Throw-Away Traffic to Bots: Detecting the Rise of DGA-Based Malware.” Last accessed June 25, 2013, http://www.usenix.org/sites/default/files/conference/protected-files/antonakakis_sec12_slides.pdf.

• Markus Jakobsson and Jacob Ratkiewicz. “Designing Ethical Phishing Experiments: A Study of (ROT13) rOnl Query Features.” Last accessed June 26, 2013, http://wwwconference.org/www2006/programme/files/pdf/3533.pdf.

• McAfee, Inc. (February 21, 2012). “W32/Virut.gen!1ED0DD2F830C.” Last accessed June 26, 2013, http://www.mcafee.com/threat-intelligence/malware/default.aspx?id=854659.























33

• M. Jakobsson, P. Finn, and N. Johnson. “Why and How to Perform Fraud Experiments.” Last accessed June 26, 2013, http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4489852&sortType%3Dasc_p_Sequence%26filter%3DAND(p_IS_Number%3A4489835).

• M. Lee and D. Lewis. “Clustering Disparate Attacks: Mapping the Activities of the Advanced Persistent Threat.” Last accessed June 26, 2013, http://www.academia.edu/2352875/CLUSTERING_DISPARATE_ATTACKS_MAPPING_THE_ACTIVITIES_OF_THE_ADVANCED_PERSISTENT_THREAT.

• Network Working Group. (January 2005). “Uniform Resource Identifier (URI): Generic Syntax.” Last accessed June 25, 2013, http://www.ietf.org/rfc/rfc3986.txt.

• Olivier Thonnard, Leyla Bilge, Gavin O’Gorman, Seán Kiernan, and Martin Lee. (2012). “Industrial Espionage and Targeted Attacks: Understanding the Characteristics of an Escalating Threat.” Last accessed June 26, 2013, http://link.springer.com/chapter/10.1007%2F978-3-642-33338-5_4.

• Python. “Parallel Processing and Multiprocessing in Python.” Last accessed June 26, 2013, http://wiki.python.org/moin/ParallelProcessing.

• Python Software Foundation. (June 25, 2013). “11.1. pickle—Python Object Serialization.” Last accessed June 26, 2013, http://docs.python.org/2/library/pickle.html#module-cPickle.

• Roberto Perdiscia, Wenke Leea, and Nick Feamstera. “Behavioral Clustering of HTTP-Based Malware and Signature Generation Using Malicious Network Traces.” Last accessed June 25, 2013, https://215a1886-a-62cb3a1a-s-sites.googlegroups.com/site/robertoperdisci/publications/publication-files/nsdi10-final190.pdf ?attachauth=ANoY7coAd9uKqooy2xG8HK_2DIFSSsu1-zZAqqSbogbSF6RYYhZNCGEm3x8mfshPEF_V4AsSQpTgrbdH_ROECsCUpWEUAg01-gFv_qXp0_TliEKOTwDRhho_zYHYHJP_aMF8KNqx2p8JwszBE2MU28iNe3O-b8iFsGXj9JtG1V73AY-YTjuZ47IiYDQYSebhlDmC7KwkLhsqeLq039KKegGEmX_ZTNPzvJtiE8lsBHaSJbDCV4IrQ41TJTbqcZQtAtMO1ut6H3Z-hS8ITt5qaX72nZWRCwNPZQ%3D%3D&attredirects=0.

• Shun-Te Liu, Yi-Ming Chen, and Hui-Ching Hung. (2012). “N-Victims: An Approach to Determine N-Victims for APT Investigations.” Last accessed June 26, 2013, http://link.springer.com/chapter/10.1007%2F978-3-642-35416-8_16.







http://www.ietf.org/rfc/rfc3986.txt

















34

• Softpedia. (January 21, 2013). “AlienVault and Kaspersky Help Organizations Neutralize Red October Attack.” Last accessed June 24, 2013, http://news.softpedia.com/news/AlienVault-and-Kaspersky-Help-Organizations-Neutralize-Red-October-Attack-322919.shtml.

• Symantec. “Anatomy of a Data Breach: Why Breaches Happen and What to Do About It.” Last accessed June 26, 2013, http://eval.symantec.com/mktginfo/enterprise/white_papers/b-anatomy_of_a_data_breach_WP_20049424-1.en-us.pdf.

• Ting-Fang Yen and Michael K. Reiter. (2008). “Traffic Aggregation for Malware Detection.” Last accessed June 25, 2013, http://www.rsa.com/rsalabs/staff/bios/tfyen/publications/DIMVA08-TAMD.pdf.

• Trend Micro Incorporated. (2013). “Smart Protection Network—Data Mining Framework.” Last accessed June 26, 2013, http://cloudsecurity.trendmicro.com/us/technology-innovation/our-technology/smart-protection-network/index.html.

• Ulrich Bayer, Paolo Milani Comparetti, Clemens Hlauschek, Christopher Kruegel, and Engin Kirda. “Scalable, Behavior-Based Malware Clustering.” Last accessed June 26, 2013, http://www.cs.ucsb.edu/~chris/research/doc/ndss09_cluster.pdf.










http://www.cs.ucsb.edu/~chris/research/doc/ndss09_cluster.pdf

Trend Micro Incorporated, a global leader in security software, strives to make the world safe for exchanging digital information. Our innovative solutions for consumers, businesses and governments provide layered content security to protect information on mobile devices, endpoints, gateways, servers and the cloud. All of our solutions are powered by cloud-based global threat intelligence, the Trend Micro™ Smart Protection Network™, and are supported by over 1,200 threat experts around the globe. For more information, visit www.trendmicro.com.

©2013 by Trend Micro, Incorporated. All rights reserved. Trend Micro and the Trend Micro t-ball logo are trademarks or registered trademarks of Trend Micro, Incorporated. All other product or company names may be trademarks or registered trademarks of their owners.

10101 N. De Anza Blvd.Cupertino, CA 95014

U.S. toll free: 1 +800.228.5651Phone: 1 +408.257.1500Fax: 1 +408.257.2003

http://www.trendmicro.com/us/index.html

Targeted Attacks Detection with SPuNge - Trend Micro · Trend Micro Targeted Attacks Detection with...

Documents

Transcript of Targeted Attacks Detection with SPuNge - Trend Micro · Trend Micro Targeted Attacks Detection with...