ML Approaches to P2P Botnet Detection

55
BITS Pilani Hyderabad Campus ML Approaches to P2P Botnet Detection by Vansh Khurana Mentors: Dr. Chittaranjan Hota, Pratik Narang and Team

Transcript of ML Approaches to P2P Botnet Detection

Page 1: ML Approaches to P2P Botnet Detection

BITS PilaniHyderabad Campus

ML Approaches to P2P Botnet Detection

byVansh Khurana

Mentors: Dr. Chittaranjan Hota, Pratik Narang and Team

Page 2: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Introduction to P2P• Analyzing available Dataset• Deep Dive into P2P Botnets• Classification Algorithms• System Design• Prelim Results with Extensive Feature Set• Curse of Dimensionality• Ensemble Classifier• Future Work

Today’s Agenda

Page 3: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• P2P Networks

• Botnets

• Malicious Activities

Introduction

Page 4: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Decentralized and distributed network architecture• Peers act as both suppliers and consumers of resources• Better resource utilization• No Central Coordination

• Applications:• Instant Messaging Systems: Skype• Digital currency: Bitcoin• Wireless community networks: Netsukuku• Foreign Currency Exchange Market Place: CurrencyFair• Content Delivery: Torrent Applications• File sharing: Gnutella, DC++ • And many more….

P2P Networks

Page 5: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• P2P Networks

• Botnets

• Malicious Activities

Introduction

Page 6: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• A network of compromised machines (bots) controlled by a bot master

• Typical Formation: Spam Example• A botnet operator sends out viruses or worms, infecting ordinary users' computers,

whose payload is a malicious application—the bot.• The bot on the infected PC logs into a particular C&C server.• A spammer purchases the services of the botnet from the operator.• The spammer provides the spam messages to the operator, who instructs the

compromised machines via the control panel on the web server, causing them to send out spam messages

• Should We care? -Absolutely!• Privacy Invasion – Hacked Accounts, Weak Passwords, Credential reuse, social

engineering attacks• Financial Theft- 10 days of Torpig data valued at $83K to $8.3M (2009)

Botnets

Page 7: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• P2P Networks

• Botnets

• Malicious Activities

Introduction

Page 8: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Responsible for (non-exhaustive list):• Large-scale network probing (i.e., scanning activities)• Launching Distributed Denial of Service (DDoS) attacks• Sending large-scale unsolicited emails (SPAM)• Click-fraud campaign• Information theft• Spyware• Adware

Shift from a for-fun activity towards a profit-oriented business

Malicious Activities

Page 9: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Introduction to P2P• Analyzing available Dataset• Deep Dive into P2P Botnets• Classification Algorithms• System Design• Prelim Results with Extensive Feature Set• Curse of Dimensionality• Ensemble Classifier• Future Work

Page 10: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Obtained from University of Georgia: Babak Rahbirinia et al.• Benign Applications: Emule, Frostwire, uTorrent, Vuze

• ~ 130 gigs of Raw log files

• Malicious Botnet data:• Storm, ~9 gigs• Waledac, ~ 1 gig• Zeus, ~ 100 mb• Nugache ~100 mb

• Data for Botnets contain only C&C messages• We build training models to represent real world behaviour –

~80:20

More on Botnets next!

Analyzing available Dataset

Page 11: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• P2P Botnets: Lifecycle

• Botnets of Interest• Storm• Waledac• Zeus• Nugache

Deep Dive into P2P Botnets

Page 12: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Initial infection- Exploit known vulnerability, grant additional capabilities to the attacker on the target system

• Secondary injection- leverage newly acquired access to execute additional scripts or programs which then fetch a malicious binary from a known location

• Connection: bot attempts to establish a connection to the command and control server through a variety of methods

• Malicious command and control – Doing the damage

• Update and maintenance -bots are commanded to update their binaries, typically to defend against new attacks or to improve their functionality.

P2P Botnets: Lifecycle

Page 13: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• P2P Botnets: Lifecycle

• Botnets of Interest• Storm• Waledac• Zeus• Nugache

Deep Dive into P2P Botnets

Page 14: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Email Spam Botnet (2007)• Extent of Damage: 1 million to 50 million computer

systems• Methodology

• Observed to be defending itself, and attacking computer systems that scanned for Storm virus-infected computer systems online.

• DDoS counter-attacks, to maintain its own internal integrity

• Fools the antivirus on local system: Actual processes do nothing

• Mostly uses UDP as underlying transport layer protocol

Storm

Page 15: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• P2P Botnets: Lifecycle

• Botnets of Interest• Storm• Waledac• Zeus• Nugache

Deep Dive into P2P Botnets

Page 16: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Email Spam Botnet (2010)• Extent: ~70k to 90K• Infection method:

• Email• Fake Websites• Bundled with other threats ,such as Trojan.Peacomm, W32.Downadup,

and Trojan.Bredolab

• Typically sends log file every 30 minutes• Mostly Operates over TCP• More here:

http://www.symantec.com/security_response/writeup.jsp?docid=2008-122308-1429-99

Waledac

Page 17: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• P2P Botnets: Lifecycle

• Botnets of Interest• Storm• Waledac• Zeus• Nugache

Deep Dive into P2P Botnets

Page 18: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Often used to steal banking information by man-in-the-browser keystroke logging and form grabbing, web page injection (2007)

• Spread mainly through drive-by downloads and phishing schemes

• Extent: 3.6 million in US alone- Operates very stealthily

• Uses both TCP/ UDP protocols // flow concept fails

Zeus

Page 19: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• P2P Botnets: Lifecycle

• Botnets of Interest• Storm• Waledac• Zeus• Nugache

Deep Dive into P2P Botnets

Page 20: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• One of the first sophisticated p2p botnets (2006)

• Created by Jason Michael Milmont, when he was 16!

• TCP port 8 bot, listens on port 8

• Didn’t Use: log files don’t represent theoretical informationMore here:http://www.symantec.com/security_response/writeup.jsp?

docid=2006-043016-0900-99&tabid=2

Nugache

Page 21: ML Approaches to P2P Botnet Detection

BITS PilaniHyderabad Campus

DEMO 1: Botnet Data Analysis

Page 22: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Introduction to P2P• Analyzing available Dataset• Deep Dive into P2P Botnets• Classification Algorithms• System Design• Prelim Results with Extensive Feature Set• Curse of Dimensionality• Ensemble Classifier• Future Work

Page 23: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Decision Trees: • Based on Information Gain

• Fast, ignores irrelevant features• But can overfit. We use REP Trees and set maximum depth to avoid this

• K- Nearest Neighbour: • Inherently simple, doesn’t overfit

• Artificial Neural Networks: • large number of features can be well handled• Heuristically set hidden layers:

(No. of Features + No. of Class Labels) / 2• SVM:

• perform extremely complex kernel-based data transformations, and then find an optimal boundary between the possible outputs based on these transformations.

• pairwise classification approach (one-versus-one)

Classification Algorithms

Page 24: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Introduction to P2P• Analyzing available Dataset• Deep Dive into P2P Botnets• Classification Algorithms• System Design• Prelim Results with Extensive Feature Set• Curse of Dimensionality• Ensemble Classifier• Future Work

Page 25: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Salient Features

• Background

• Modules

System Design

Page 26: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• No reliance on Encryption and Deep Packet Inspection• Intuitive and Simplistic Model to solve a complex

problem• Model Bot Behaviour accurately• Explored Feature Set extensively (~75 Features)• Explored Non Network based features

• Compression Ratio• Signal Processing Approach to model network behaviour• More on this later…

• Most importantly: Achieved Good Results

Salient Features

Page 27: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Salient Features

• Background

• Modules

System Design

Page 28: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Shannon’s Source Coding Theorem

• The expected length L of an encoding of X withassociated probability function p(x) is given by:

• Bot data expected to be more uniform, hence should give more compression

Background

Page 29: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Discrete Fourier Transform:• Converts a finite list of equally spaced samples of a function into the list

of coefficients of a finite combination of complex sinusoids, ordered by their frequencies• Time domain to frequency domain to extract hidden patterns in botnet communication

• The network communication between a pair of nodes is treated as a `signal'.

• Given a time sequence X = X(0);X(1) : : :X(w), its Discrete Fourier Transform (DFT) is given as-

• The first few DFT coefficients contain most of the energy

Background

Page 30: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Salient Features

• Background

• Modules

System Design

Page 31: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

Modules

Page 32: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Introduction to P2P• Analyzing available Dataset• Deep Dive into P2P Botnets• Classification Algorithms• System Design• Prelim Results with Extensive Feature Set• Curse of Dimensionality• Ensemble Classifier• Future Work

Page 33: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Dataset Description

• Used Extensive Feature Set

Prelim results (ACM DEBS)

Page 34: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

Prelim Results (ACM DEBS)

Overall Precision and Recall

Page 35: ML Approaches to P2P Botnet Detection

BITS PilaniHyderabad Campus

DEMO 2:System Design & Prelim Results

Page 36: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Introduction to P2P• Analyzing available Dataset• Deep Dive into P2P Botnets• Classification Algorithms• System Design• Prelim Results with Extensive Feature Set• Curse of Dimensionality• Ensemble Classifier• Future Work

Page 37: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Principal Component Analysis

• Feature Selection Algorithms

Curse of Dimensionality

Page 38: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Statistical procedure that uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

• Converted 76 Features to 28 features retaining 95% Variance

• Classifier Accuracy:• J-48: 97%• REP Tree: 94.17%• SVM: 80.61%• K-NN: 93.29%• Bayesian Networks: 94.65%

Principal Component Analysis

Page 39: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Principal Component Analysis

• Feature Selection Algorithms

Curse of Dimensionality

Page 40: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Best First Search Based Feature Selection• Also explored random forest based importance evaluation• Selected Features are:

• Flow duration, MEDIAN_INTERARRIVAL_TIME, AVG_PAYLOAD_SIZE,AVG_PAYLOAD_SIZE_SENDING, AVG_PAYLOAD_SIZE_RECVING, PRIME_WAVE_MAGNITUDE_PAYLOAD, PRIME_WAVE_MAGNITUDE_IAT, BYTES_SENT_PER_SEC, BYTES_RECVD_PER_SEC, DFT_Payload(1st and 2nd co-efficient), DFT_IAT(1st and 2nd Coefficient), Compression

• Classifier Accuracy:• J-48: 99.7%• REP Tree: 99.58%• SVM: 84%• KNN: 99.5%• ANN: 92%

Feature Selection

Page 41: ML Approaches to P2P Botnet Detection

BITS PilaniHyderabad Campus

DEMO 3:PCA and Feature Selection

Page 42: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Introduction to P2P• Analyzing available Dataset• Deep Dive into P2P Botnets• Classification Algorithms• System Design• Prelim Results with Extensive Feature Set• Curse of Dimensionality• Ensemble Classifier• Future Work

Page 43: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Construct a set of classifiers from the training data• Predict class label of previously unseen records by aggregating

predictions made by multiple classifiers

Ensemble Classifier

Page 44: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Ensemble methods work better with ‘unstable classifiers’

• Classifiers that are sensitive to minor perturbations in the training set

• Examples:– Decision trees– Rule-based– Artificial neural networks

Ensemble Classifier

Page 45: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• One way to force a learning algorithm to construct multiple hypotheses is to run the algorithm several times and provide it with somewhat different data in each run. This idea is used in the following methods:

• Majority Voting • Bagging• Randomness Injection• Feature-Selection Ensembles• Error-Correcting Output Coding.

Ensemble Classifier

Page 46: ML Approaches to P2P Botnet Detection

Why Majority Voting works?

• Suppose there are 25 base classifiers– Each classifier has

error rate, = 0.35– Assume errors made

by classifiers are uncorrelated

– Probability that the ensemble classifier makes a wrong prediction:

25

13

25 06.0)1(25

)13(i

ii

iXP

Page 47: ML Approaches to P2P Botnet Detection

Bagging

• Employs simplest way of combining predictions that belong to the same type.

• Combining can be realized with voting or averaging• Each model receives equal weight• “Idealized” version of bagging:

– Sample several training sets of size n (instead of just having one training set of size n)

– Build a classifier for each training set– Combine the classifier’s predictions

• This improves performance in almost all cases if learning scheme is unstable (i.e. decision trees)

Page 48: ML Approaches to P2P Botnet Detection

Bagging classifiers

Classifier generationLet n be the size of the training set.For each of t iterations: Sample n instances with replacement from the training set.

Apply the learning algorithm to the sample. Store the resulting classifier.

classificationFor each of the t classifiers: Predict class of instance using classifier.Return class that was predicted most often.

Page 49: ML Approaches to P2P Botnet Detection

Why does bagging work?

• Bagging reduces variance by voting/ averaging, thus reducing the overall expected error– In the case of classification there are pathological

situations where the overall error might increase

– Usually, the more classifiers the better

Page 50: ML Approaches to P2P Botnet Detection

Stacking

• Uses meta learner instead of voting to combine predictions of base learners– Predictions of base learners (level-0 models) are

used as input for meta learner (level-1 model)

• Base learners usually different learning schemes

• Hard to analyze theoretically: “black magic”

Page 51: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Introduction to P2P• Analyzing available Dataset• Deep Dive into P2P Botnets• Classification Algorithms• System Design• Prelim Results with Extensive Feature Set• Curse of Dimensionality• Ensemble Classifier• Future Work

Page 52: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Genetic Search and Greedy Classification (June)

• Suggested Improvements and Proposals:• Clustering to scale up• Binning to compute Compression Ratio• Smoothening DFT Curve• Explore parson’s coding theory

• Repo link (Stay Tuned!)• https://github.com/vansh21k/P2P-Botnet-Detection-Project

Future Work

Page 53: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

• Introduction to P2P• Analyzing available Dataset• Deep Dive into P2P Botnets• Classification Algorithms• System Design• Prelim Results with Extensive Feature Set• Curse of Dimensionality• Ensemble Classifier• Future Work

Page 54: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

[1] H. Hang, X. Wei, M. Faloutsos, and T. Eliassi-Rad. Entelecheia: Detecting p2p botnets in their waiting

stage. In IFIP Networking Conference, 2013, pages 1{9, 2013.

[2] J. Kang and J.-Y. Zhang. Application entropy theory to detect new peer-to-peer botnet with multi-chart

cusum. In Electronic Commerce and Security, 2009.

[3] C. Kanich, N. Weaver, D. McCoy, T. Halvorson,C. Kreibich, K. Levchenko, V. Paxson, G. M. Voelker,

and S. Savage. Show me the money: Characterizing spam-advertised revenue. In USENIX Security

Symposium, pages 15

[4] B. Rahbarinia, R. Perdisci, A. Lanzi, and K. Li. Peerrush: Mining for unwanted p2p trac. In

Detection of Intrusions and Malware, and Vulnerability Assessment, pages 62{82. Springer, 2013.

[5] C. Rossow, D. Andriesse, T. Werner, B. Stone-Gross, D. Plohmann, C. J. Dietrich, and H. Bos. Sok:

P2pwned-modeling and evaluating the resilience of peer-to-peer botnets. In Security and Privacy (SP),

2013 IEEE Symposium on, pages 97{111. IEEE, 2013.

[6] S. Saad, I. Traore, A. Ghorbani, B. Sayed, D. Zhao, W. Lu, J. Felix, and P. Hakimian. Detecting p2p

botnets through network behavior analysis and machine learning. In Privacy, Security and Trust

(PST), 2011 Ninth Annual International Conference on, pages 174{180. IEEE, 2011.

[7] R. Schoof and R. Koning. Detecting peer-to-peer

botnets. University of Amsterdam, 2007. Technical

report.

[8] F. Tegeler, X. Fu, G. Vigna, and C. Kruegel. Botnder: Finding bots in network trac without

deep packet inspection. In Proceedings of the 8 th international conference on Emerging networking

experiments and technologies, pages 349{360. ACM, 2012.

[9] T.-F. Yen and M. K. Reiter. Are your hosts trading or plotting? telling p2p le-sharing and bots apart. In

Distributed Computing Systems (ICDCS), 2010 IEEE 30th International Conference on, pages 241{252.

IEEE, 2010.

References

Page 55: ML Approaches to P2P Botnet Detection

CS C441 / CS F441 Second Semester 2013-14 BITS Pilani, Hyderabad Campus

[10] X. Yu, X. Dong, G. Yu, Y. Qin, D. Yue, and Y. Zhao. Online botnet detection based on incremental discrete

fourier transform. Journal of Networks, 5(5), 2010.

[11] J. Zhang, R. Perdisci, W. Lee, X. Luo, and U. Sarfraz. Building a scalable system for stealthy p2p-botnet

detection. Information Forensics and Security, IEEE Transactions on, 9(1):27{38, 2014.

[12] J. Zhang, R. Perdisci, W. Lee, U. Sarfraz, and X. Luo. Detecting stealthy p2p botnets using statistical trac

ngerprints. In Dependable Systems & Networks (DSN), 2011 IEEE/IFIP 41st International Conference on, pages 21{132. IEEE, 2011.

[13] S. Zhang. Conversation-based p2p botnet detection with decision fusion. Master's thesis, Fredericton:

University of New Brunswick, 2013.

References