Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ......

Post on 17-Jun-2018

228 views 1 download

Transcript of Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ......

1Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Software Engineering InstituteCarnegie Mellon UniversityPittsburgh, PA 15213

© 2016 Carnegie Mellon UniversityApproved for Public Release; Distribution is Unlimited

Applied Machine Learning in Software SecurityEliezer Kanal

2Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Copyright 2017 Carnegie Mellon University

This material is based upon work funded and supported by the Department of Defense under Contract No. FA8721-05-C-0003 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center.

NO WARRANTY. THIS CARNEGIE MELLON UNIVERSITY AND SOFTWARE ENGINEERING INSTITUTE MATERIAL IS FURNISHED ON AN “AS-IS” BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE OR MERCHANTABILITY, EXCLUSIVITY, OR RESULTS OBTAINED FROM USE OF THE MATERIAL. CARNEGIE MELLON UNIVERSITY DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT.

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

This material may be reproduced in its entirety, without modification, and freely distributed in written or electronic form without requesting formal permission. Permission is required for any other use. Requests for permission should be directed to the Software Engineering Institute at permission@sei.cmu.edu.

Carnegie Mellon® and CERT® are registered marks of Carnegie Mellon University.

DM-0004563

3Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Tom Mitchell, former CMU Machine Learning department chair:

The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?”

Machine Learning seeks to automate data analysis and inference.

What is Machine Learning?

4Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

What is Machine Learning?

If your problem can be stated as either of the following:

…you would likely benefit from machine learning.

Iwouldliketouse_____datatopredict_____.Iwouldliketouse____data

toguesswhat____is.

5Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

What is Machine Learning?

Sample Techniques:

• Regression

• K-Means Clustering

Clustering image: Weston.pace, https://commons.wikimedia.org/wiki/File:K_Means_Example_Step_4.svg

6Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

What is Machine Learning?

Feature Engineering:Using existing data to create more informative data

DataTypes

Image StaticVideo

Timeseries FinancialdataEvent counts

Structuredtext WebformsStructureddata

(JSON,XML)Sourcecode

Freetext NewsTweetsEmail

manymore…

7Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

What is Machine Learning?

Examples:

o I would like to use incident ticket data to predictcustomer needs .

o I would like to use publicly available code to predict what code I will write .

o I would like to use bug report data to guess the location of undetected bugs in my code .

8Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

9Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

What is Machine Learning?

Examples:

o I would like to use incident ticket data to predictcustomer needs .

o I would like to use publicly available code to predict what code I will write .

o I would like to use bug report data to guess the location of undetected bugs in my code .

10Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML:Vulnerability Detection

11Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Vulnerability Detection

Analyzer

Analyzer

Analyzer

Codebases

Alerts Today3,147

11,772

48,690

0

10,000

20,000

30,000

40,000

50,000

60,000

TP FP Susp

Manyalertsleftunaudited!

12Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Vulnerability Detection

Analyzer

Analyzer

Analyzer

Codebases

Alerts Today

OurGoal

3,147

11,772

48,690

0

10,000

20,000

30,000

40,000

50,000

60,000

TP FP Susp

66effortdays

12,076

45,172

6,361

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

50,000

e-TP e-FP I

AutomatedStatisticalClassifier

• ExpectedTruePositive(e-TP)• ExpectedFalsePositive(e-FP)• Indeterminate(I)

13Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Vulnerability Detection

ClassifiersLassoLogisticRegressionCARTRandomForestExtremeGradientBoosting(XGBoost)

Some ofthefeatures usedAnalysistoolsused Tokensinfunc/methodSignificantLOC Alertsinfunc/methodComplexity AlertsinfileCoupling MethodsinfileCohesion SLOCinfileSEIcodingrule Avg TokensFunction/methodlength Avg SLOCSLOCinfunc/method Depthincoderepository#parametersinfunc/meth.

Cyclomatic complexity(func/meth)

14Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Vulnerability Detection

Significant improvement!

• 91% Classifier accuracy overall

• Specific rule accuracy at right

• 10x developer time saved!

RuleID LassoLRRandomForest CART XGBoost

INT31-C 98% 97% 98% 97%EXP01-J 74% 74% 81% 74%OBJ03-J 73% 86% 86% 83%FIO04-J* 80% 80% 90% 80%EXP33-C* 83% 87% 83% 83%EXP34-C* 67% 72% 79% 72%DCL36-C* 100% 100% 100% 100%ERR08-J* 99% 100% 100% 100%IDS00-J* 96% 96% 96% 96%ERR01-J* 100% 100% 100% 100%ERR09-J* 100% 88% 88% 88%

* Small quantity of data

RuleID LassoLRRandomForest CART XGBoost

INT31-C 98% 97% 98% 97%EXP01-J 74% 74% 81% 74%OBJ03-J 73% 86% 86% 83%FIO04-J* 80% 80% 90% 80%EXP33-C* 83% 87% 83% 83%EXP34-C* 67% 72% 79% 72%DCL36-C* 100% 100% 100% 100%ERR08-J* 99% 100% 100% 100%IDS00-J* 96% 96% 96% 96%ERR01-J* 100% 100% 100% 100%ERR09-J* 100% 88% 88% 88%

15Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML:Malware family classification

16Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Malware family classification

ReverseEngineering

Discovery

Refinement

Reflection

File

NewFamily

ArtifactCatalog

Signature 1Signature 2Signature 3

Files 1a 1b 1c 1d …Files 2a 2b 2c 2d …Files 3a 3b 3c 3d …

Whichfileshang

together?

17Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Malware family classification

SignalFlowgraphhighlightsbehaviorrelatingdifferent

malwarefamilies

Programinstructionanalysisshowssimilarityanddiversionofbehavior

StaticAnalysisidentifiesprogramswithsimilarsourcecode

18Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Malware family classification

Simplify visualization of extremely complex data through the use of dimensionality reduction and associated visualization techniques

Chernoff faceexperiment

t-SNE

19Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Malware family classification

• Ground Truth: SVM trained with expert ground truth labels.

• Turkers Avg: Classifier trained with layperson labels.

Performance surprisingly similar!

20Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML:Software cost estimation

RobertFerguson,DennisGoldenson,JamesMcCurley,RobertW.Stoddard,DavidZubrow,DebraAnderson.“QuantifyingUncertaintyinEarlyLifecycleCostEstimation(QUELCE)”.Dec2011.http://resources.sei.cmu.edu/library/asset-view.cfm?assetid=10039

21Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Software cost estimation

GeneralAccountingOffice.DefenseAcquisitions:AKnowledge-BasedFundingApproachCouldImproveMajorWeaponSystemProgramOutcomes.ReporttotheCommitteeonArmedServices,U.S.Senate,July2008,GAO-08-619.

22Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Software cost estimation

23Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Software cost estimation

24Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Software cost estimation

25Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML:Incident report mapping

26Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Incident report mapping

International partners

Agency infrastructure

Threat actors

Phishing campaign

Agency response

teams US-CERT infrastructure

Malware class

Compromised website

Common reference websites

Observation

Landscape Tickets

Inference

27Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Incident report mapping

28Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Indicators across ticketsIndicatorsoccurwithdiversepatternsacrosstickets,reportersandtime.Timeonxaxis,countonyaxis,colorcodedbyreporter.

MaliciousIP

AgencyIP

US-CERTdomain

29Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Similarity of indicatorsBeginningwithareferenceindicator,wefindindicatorssimilartoit.Example:amaliciousIP

• Coloredcirclesaretickets• Greycirclesareindicators• Largeindicatorsnear

centerofcirclehavesimilaroccurrencepatternstothereferenceindicator.

30Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Indicator communitiesButwhatifwearen’tstartingwithareferenceindicator?Weassumethatindicatorsgeneratedbyacoherentrealworldprocesswillbemorelikelytoco-occurinticketsthanarbitrarypairsofindicators.Findgroupsofhighlysimilarindicatorsincompleteindicator-ticketgraph.

31Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Indicator-ticket graph

Asubsetoftheticket-indicatorgraph(forasmallsetofselectedindicators)• Ticketsaregreytriangles• Indicatorsareblackcircles• Edgesconnectticketstotheindicatorsthey

contain

32Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Contact Information

Eliezer KanalTechnical ManagerTelephone: +1 412.268.5204Email: ekanal@sei.cmu.edu