Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware
description
Transcript of Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware
CISC 879 - Machine Learning for Solving Systems Problems
Presented by: Rag Mayur ChevuriDept of Computer & Information Sciences
University of Delaware
BBehavioural Classification
Tony Lee and Jigar J Mody
CISC 879 - Machine Learning for Solving Systems Problems
Automatic malware classification • Human analysis inefficient and inadequate.
• Large number of new virus/spyware families
• Our focus : Classification problem
• Effective classification
Better Detection
Better Cleaning
Better Analysis solutions
CISC 879 - Machine Learning for Solving Systems Problems
Classification Process
CISC 879 - Machine Learning for Solving Systems Problems
Objectives of classification methodologies
• Efficiently and automatically.
• Minimal information loss.
• Structured to be stored, analyzed and referenced efficiently.
CISC 879 - Machine Learning for Solving Systems Problems
Objectives of classification methodologies (contd..)
• Applies learned knowledge to identifyfamiliar pattern and similarity relations in a given target automatically
• Adaptable and has innate learning abilities.
CISC 879 - Machine Learning for Solving Systems Problems
Approach
• Automated classification method based on:-runtime behavioral data -machine learning.
• Represent a file by its runtime behavior• Structure the event information • Store them in database. • Construct classifiers • Apply classifiers for the new objects
CISC 879 - Machine Learning for Solving Systems Problems
A “good” knowledge representation
• Effectively capture knowledge of the object to represent
• The representation can persist in permanent storage.
• Enable classifiers to efficiently and effectively correlate data across large number of objects.
CISC 879 - Machine Learning for Solving Systems Problems
Representing behavior:• The meaning of a particular action -
resulted state• Construct the representation in a
consistent canonical format.
Vector Approach• Process data in vector format using
statically and probabilistic algorithms • Problem: vector size, scalability, and
factorability.
CISC 879 - Machine Learning for Solving Systems Problems
The Opaque Object Approach
• Objects represent data in rich syntax
• Rich semantic representation of theactual object
• Precise distance between objects used for Clustering
CISC 879 - Machine Learning for Solving Systems Problems
Events Representation
• Sequence of events • Ordered according to
• time of the occurrence of program actions
• environment state transitions.
00:00 00:04
Registry Query File Write
Open Process
Network Listen
Registry Write
Allocate VM
Write VM
Terminate Process
Open Mutant Create Mutant
CISC 879 - Machine Learning for Solving Systems Problems
Event Properties
• Event ID • Event object (e.g registry, file,
process, socket, etc.) • Event subject if applicable (i.e. the
process that takes the action) • Kernel function called if applicable • Action parameters (e.g. registry value,
file path, IP address) • Status of the action (e.g. file handle
created, registry removed, etc.)
CISC 879 - Machine Learning for Solving Systems Problems
An example (Register Event)
CISC 879 - Machine Learning for Solving Systems Problems
Generate Classifier for Classification
CISC 879 - Machine Learning for Solving Systems Problems
Which classifier?
• Case-based Classifier by treating existing malware collection as a database of solutions.
• Learn by CBR
• Nearest Neighbor algorithms.
• To make the CBR approach scalable, Apply “Clustering”.
CISC 879 - Machine Learning for Solving Systems Problems
Clustering
• Unsupervised learning • Organize objects into clusters• A cluster is a collection of objects
which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
CISC 879 - Machine Learning for Solving Systems Problems
Distance Measure
• Levenshtein Distance – “minimum cost required to transform one sequence of objects to another sequence by applying a set of operations. ”
• Operation = Op (Event) • Cost (Transformation) = Σi Cost
(Operationi) • Cost of operation depends on operator
as well as the operand
CISC 879 - Machine Learning for Solving Systems Problems
Operation Cost Matrix for Similarity Measure
CISC 879 - Machine Learning for Solving Systems Problems
k-medoid partitioning clustering algorithm
Place K points into the space.These points represent initial group Medoids.
Assign each object to the group that has the closest Medoid
Recalculate the positions of the K Medoids.
Repeat 2 and 3 until the Medoids no longer move.
CISC 879 - Machine Learning for Solving Systems Problems
Classifying a new object
Nearest Neighbor Classification
Compare the new object to all the medoids .
Assign the new object the family name of the closest medoid.
CISC 879 - Machine Learning for Solving Systems Problems
Experiment
• an automated distributed replication system
CISC 879 - Machine Learning for Solving Systems Problems
Data Analysis
• Test data :Experiment 1: 461 samples of 3 families Experiment 2: 760 samples of 11
families. • 10 fold cross validation • We vary and contrast experiments by
adjusting two parameters: • number of clusters (K),maximum
number of events(E)• Measure Error rate &Accuracy Gain
CISC 879 - Machine Learning for Solving Systems Problems
• Error rate is defined as ER = number of incorrectly classified samples / total number of samples.
• Accuracy , AC = 1 – ER
• Accuracy Gain of x over y : G(x,y) = | (ER(y) – ER(x))/ER(x) |
CISC 879 - Machine Learning for Solving Systems Problems
Experiment A
CISC 879 - Machine Learning for Solving Systems Problems
CISC 879 - Machine Learning for Solving Systems Problems
CISC 879 - Machine Learning for Solving Systems Problems
CISC 879 - Machine Learning for Solving Systems Problems
Observations
• Accuracy vs. #Clusters Error rate reduces as number of clusters
increase. • Accuracy vs. Maximum #Events Error rate reduces as the event cap
increases->more events we observe-> more accurately capture-> more likely the clustering discovers the semantic similarity among variants of a family.
CISC 879 - Machine Learning for Solving Systems Problems
• Accuracy Gain vs. Number of Events The gain in accuracy is more substantial
at lower event caps (100 vs. 500) than at higher event caps (500 vs. 1000)
• Accuracy vs. Number of Families The 11-family experiment outperforms in
accuracy the 3-family experiment in high event cap tests (1000), but the result is opposite in lower event cap tests (100).
CISC 879 - Machine Learning for Solving Systems Problems
Conclusion
• Run time behavior +Machine learning allow us focus on pattern/similarity recognitions in behavior semantics
• Lack of code structural information• Combine static analysis to improve
classification accuracy• “Developing automated classification
process that applies classifiers with innate learning ability on near lossless knowledge representation is the key to the future of malware classification and defense. “
CISC 879 - Machine Learning for Solving Systems Problems
References
• Jeff Kephart, Dave Chess and Steve White (1997). Blueprint for a Computer Immune System.
• Ford R.A., Thompson H.H. (2004). The future of Proactive Virus Detection.
• Wagner M. (2004). Behavior Oriented Detection of Malicious Code at Run-time. M.Sc. Thesis, Florida Institute of Technology
• Richard Ford, Jason Michalske (2004). Gatekeeper II: New approaches to Generic Virus Prevention.