Download - Applied Machine learning for data exfiltration and other fun topics

Transcript

AppliedMachineLearning�Wolff�Wallace�Zhao

APPLIED MACHINE LEARNING FOR DATA

EXFIL AND OTHER FUN TOPICS

Matt Wolff, Chief Data Scientist Brian Wallace, Security Researcher Xuan Zhao, Data Scientist

AppliedMachineLearning�Wolff�Wallace�Zhao

GOALS OF THIS TALK: APPLIEDMACHINELEARNING•  Iden5fysuitableproblemsforMLapproaches• DemonstratebyexamplehowtoapplyML• Helpjumpstartaddi5onalresearchintheSecurity+MLspace

AppliedMachineLearning�Wolff�Wallace�Zhao

WHOWEARE/CYLANCE•  Endpointsecuritycompanybuiltaroundthecapabili5esofar5ficialintelligence•  Protec5ngmillionsofenterpriseendpoints•  Foundedin2012,$177mmraised•  Booth#1124

Ma3WolffChiefDataScien5st

BrianWallaceSeniorSecurityResearcher

XuanZhaoDataScien5st

AppliedMachineLearning�Wolff�Wallace�Zhao

TALKOVERVIEW

• MachineLearningIntroduc5on• NMAPClustering•  FeatureSpaces• Distances• Clustering

• BotnetPanelIden5fica5on• Classifica5on•  FeatureReduc5on

• Obfusca5ngDatawithMarkovChains

AppliedMachineLearning�Wolff�Wallace�Zhao

MACHINELEARNINGOVERVIEW

• Machinelearningtechniquesaredatadriven• Availabledatashouldbeabletosolvetheprobleminameaningfulway• Approachesexistfordealingwithrawdata,aswellaslabeledorannotateddata

AppliedMachineLearning�Wolff�Wallace�Zhao

MACHINELEARNINGOVERVIEW

• Givensomedata,differenttypesofmachinelearningcanbeapplied• Clusteringisusefulforfindingsimilarityacrossdatasettouncovertrendsorotherinsight• Withlabeleddata,classifica5oncanbeusefultobuildpredic5vemodels

AppliedMachineLearning�Wolff�Wallace�Zhao

MACHINELEARNINGOVERVIEW

• OYen,rawdatahastobetransformedinsomewaytobeusedbymachinelearningalgorithms•  Typicalprocessistoextractfeaturesfromdata,andturnthosefeaturesintovectors• VectorsarethenfedintoMLalgorithmsfortrainingorotherpurposes

AppliedMachineLearning�Wolff�Wallace�Zhao

MACHINELEARNINGOVERVIEW• Recommendedresources•  Scikit-learn.org•  Python

•  SourcecodeforalltoolsinthistalkavailableontheCylancepublicgitrepo•  h^ps://github.com/CylanceSPEAR

•  Shouldbeabletopulltalksourceandstartmodifyingasneededtosuitdatadrivenproblemsinyourownorganiza5onorresearchgroup.

AppliedMachineLearning�Wolff�Wallace�Zhao

• NMAPisapopularportscanningtool• ProduceslargeamountofdataperIPaddress•  ScansoverlargenumberofIPscanbedifficulttomakesenseof• NMAPClusteringisatoolwhichclusters(groups)IPsbasedontheiropenports,services,etc

TOOL–NMAPCLUSTERING

AppliedMachineLearning�Wolff�Wallace�Zhao

FEATURES•  Featuresareinforma5ve,discrimina5veinforma5onthatcandescribeasample/observa5on/phenomenon/etc.•  Featureextrac5onispivotaltothemachinelearningpipeline• OurfeaturesarebasedonNMAPoutput•  Eachportisafeature,eachserviceoneachportisafeature,eachversionofeachserviceoneachportisafeature,etc•  Scriptoutputincludedinfeatures(includingwebsite5tles,publickeys,etc)

AppliedMachineLearning�Wolff�Wallace�Zhao

VECTORS• Numericalrepresenta5onofasample(IPinNMAPcase)• Arrayofvalueswhichrepresentallfeaturesfromonesample• Vectorscanbethoughtofaspointsinhighdimensionalspace•  Eachfeatureisadimension,thevalueofthefeatureinthevectoristheposi5oninthatdimension•  Ifwehaveonlytwofeatures,itisreallyeasytovisualize

AppliedMachineLearning�Wolff�Wallace�Zhao

VECTORS–2D

AppliedMachineLearning�Wolff�Wallace�Zhao

VECTORS–3D

File FilenameLength

Filesize(kB) Sizeofheaders(kB)

calc.exe 8 918.528 63notepad.exe 11 193.024 45malware.exe 11 193.024 10

AppliedMachineLearning�Wolff�Wallace�Zhao

𝑎

𝑏𝑐

•  Distance:Describethediscrepancybetweentwopoints•  Physicaldistancebetweentwopoints:

Pythagorean’s theorem:

DISTANCES

a2+b2=c2

AppliedMachineLearning�Wolff�Wallace�Zhao

File FilenameLength Filesize(kB) Sizeofheaders(kb)

calc.exe 8 918.528 63

notepad.exe 11 193.024 45

malware.exe 11 193.024 10

2D 3D

AppliedMachineLearning�Wolff�Wallace�Zhao

DISTANCESMul5pleDistanceMetrics–As long as an operation satisfy certain mathematical criteria, it can be used as a distance metric

Manhattan

Euclidean

Point 1

Point 2

AppliedMachineLearning�Wolff�Wallace�Zhao

CLUSTERING• Withawaytomeasuredistance,wecangroupitemsbyhowclosetheyare,akaclustering• Clustersaredis5nctgroupsofsamples(IP)whichhavebeengroupedtogether• Clusteringisgenerallyunsupervisedlearning• Differentalgorithmswithdifferentconfigura5onsgroupthesesamplesindifferentways

AppliedMachineLearning�Wolff�Wallace�Zhao

k-Means• Clusteringalgorithm• Usersuppliesk(des5natednumberofclusters)• Allsamplesareassignedtorandomclusters• Centerofeachclusteriscomputedbytakingmean(average)ofallsamplesincluster•  Samplesarethenassignedtotheclusterwhosecentertheyareclosestto• Centersarerecomputed,algorithmloopsun5lnosampleschangeclusters

AppliedMachineLearning�Wolff�Wallace�Zhao

k-Means

AppliedMachineLearning�Wolff�Wallace�Zhao

NMAPCLUSTERING–MANUAL/AUTOMATIC• Manualallowsyoutosupplyyourownclusteringparameters• Automa5ctriesmanydifferentmethodswiththeore5cally-foundop5malparametersandpickswhatitdeterminestobethebest• Demowithmanualstrategy• Demowithautoma5cstrategy

AppliedMachineLearning�Wolff�Wallace�Zhao

NMAPCLUSTERING-INTERACTIVE•  IncorporatetheUser’sdecisionintotheclusteringprocess.•  TheClusteringresultwillbecustomizedaccordingtocustomer’spreferenceinthisway•  Process(willshowwithademo):

1.  Userdecidewhethertheclusterneedstobesplitornot:

2.  Ifyes,thensplitusingdivisiveclustering3.  Ifno,finalizethiscluster4.  Recursivelysplitun5lusersare

sa5sfiedwithalltheclusters

Y

Y N

Finalized

N

Finalized

N

Finalized

Y–SplitN–No-Split

AppliedMachineLearning�Wolff�Wallace�Zhao

TOOL–IDPANEL• Botnetpanels(commandandcontrolwebsites)canbedifficulttoiden5fy•  Needpreviousknowledgeofthebotnetpanels•  OYenmodifiedtoavoiddetec5onorvanity•  Manyarebasedoffothers,sodis5nguishingcanbedifficult

• Wecantrainamodeltoiden5fyifwearelookingatabotpanelandwhichoneitis,withasmallnumberofrequests• Minimizingthenumberofrequeststoclassifyimprovesstealthandrateofclassifica5on

AppliedMachineLearning�Wolff�Wallace�Zhao

CLASSIFICATION•  Thisisaclassifica5onproblem• Classifica5onanswers“Isitwhatwearelookingfor?”• Classifica5onisgenerallysupervisedlearning•  Supervisedlearningrequirestrainingsamplestohavelabels• Classifica5onmethodsrangefromsimpletohighlycomplex

AppliedMachineLearning�Wolff�Wallace�Zhao

IDPANELFEATURES• Botnetpanelsaresimilartonormalwebsites• Containvariousfiletypes,oYenedited• HTTPresponsecodes+contentcomparison•  Encodingcontentasfeaturesdifficult•  ssDeepprovidesacon5nuousvaluebycomparingcontent

AppliedMachineLearning�Wolff�Wallace�Zhao

COLLECTINGDATA• Collec5onofknownbotnetpanels• Requestallknownpathsforallknownbotnetpaneltypes•  StoreHTTPstatuscodeandssDeepofcontent• Collec5onofsitesthatarenotbotnetpanelsneededaswell

AppliedMachineLearning�Wolff�Wallace�Zhao

DECISIONTREES•  Decisiontreesaresimpleclassifiers•  Splitsthedatasetonefeatureata5meun5ldecisionisconfident•  Resultsinatreeofquerieswheretheresultsproduceadecision•  Simpletotrain

AppliedMachineLearning�Wolff�Wallace�Zhao

ENSEMBLEOFDECISIONTREES•  Asingledecisiontreemaybeoverfocusedontrainingdata•  Canalleviatethisproblembybuildingmul5pleDecisionTreesforeachlabel•  CombiningtheresultsallowseachDecisionTreetovote•  Par5alanswersmays5llbeofinteresttotheuser•  Ensemblescanobtainbe^erpredic5veperformance

PonyDT1 PonyDT2 PonyDT3 DexterDT1 DexterDT2 DexterDT3 ZeusDT1 ZeusDT2 ZeusDT3

It’sPony

AppliedMachineLearning�Wolff�Wallace�Zhao

IDPANELDEMO–COMMANDLINE• Quickwaytocheckifawebsitedirectorycontainsabotnetpanel•  Easytobatchsearchingofmul5plewebsites/directories•  Easytogrepresults

AppliedMachineLearning�Wolff�Wallace�Zhao

IDPANELDEMO–CHROMEEXTENSION•  Everywebsitedirectoryvisitedistested• Resultsarestoredinbrowser•  ssDeepportedfromCtoJavascript• h^ps://github.com/kripken/emscripten•  ExtensionavailableinChromeextensionstore(free,ofcourse)

AppliedMachineLearning�Wolff�Wallace�Zhao

TOOL–MARKOVOBFUSCATE•  Dataexfiltra5onfromanetworkoYenrequiresavoidinganoutboundfirewall•  Deeppacketinspec5onlookstoblockanythingundesirable•  Easytoencryptdata,butitsalsoeasytodropinforma5onthatcan’tberead• Wecanmakeourdatalooklikesomethingelseen5rely

AppliedMachineLearning�Wolff�Wallace�Zhao

•  Simplemachinelearningmethodforcharacterizingsequencedata•  Learnsthetransi5onpa^ernfromastatetoanotherbasedonhowlikelyastatecomesaYeranotherstateinthetrainingdata• Wecancreatesequenceswithtransi5onpa^ernsthatarelearnedfromthedataitwastrainedon

MARKOVCHAIN

AABBABABA

A B

A 1(25%) 3(75%)

B 2(100%) 0(0%)

Tran

siXo

nMatrix

TrainingData

A

B

AA

BA

B P(ABB)=0

P(ABA)=0.75P(AAB)=0.1875

P(AAA)=0.0625

AppliedMachineLearning�Wolff�Wallace�Zhao

POPULARUSECASESOFMARKOVCHAINSWeatherPredic,on

Sunny Rainy

Sunny 0.9999 0.0001

Rainy 0.9 0.1

Recommenda,on

0.9 0.95

0.1

0.05

0.030.13

AppliedMachineLearning�Wolff�Wallace�Zhao

ENCODINGDATAWITHAMARKOVCHAIN• Givenatransi5onmatrix,wecansortitemsbyhowlikelytheyaretofollowourcurrentitem•  Ifwechoosethe5thmostlikelyitem,wecaniden5fyit’sthe5thmostlikelywithamodeltrainedonthesamedata•  Thisencodesthenumber5inthetransi5onfromourfirstitemtoourseconditem

AppliedMachineLearning�Wolff�Wallace�Zhao

MARKOVOBFUSCATE-ENCODING•  Trainourmodelwithabook• Observingtransi5onsfromwordtoword• Generatedatabasedontransi5onprobabili5es• Demo

AppliedMachineLearning�Wolff�Wallace�Zhao

MARKOVOBFUSCATE-WRAPPING•  SimpletotransferourdatathroughapipelinethatlookslikenormalHTTPtraffic•  Lookslikeauserpos5ngtotheirblog• Demo

AppliedMachineLearning�Wolff�Wallace�Zhao

MARKOVOBFUSCATE–HAVINGFUN•  TrainourmodelsonTaylorSwiYlyrics•  TrainaMarkovModelbasedonTaylorSwiYsongs• Playthegeneratelyricsthroughfes5valwithtones/beatslearnedfromsongs•  Firstlive“TylanceSwiY”concert,demo

AppliedMachineLearning�Wolff�Wallace�Zhao

WRAPPINGUP• Anyproblemwherethereisasignificantamountofdatageneratedcouldbenefitfromamachinelearningapproach•  Lotsofgreatonlineresourcetohelpanyonegetstarted• HavinglabeledorannotateddatamakesmoreMLapproachedviablecomparedtounlabeleddata

AppliedMachineLearning�Wolff�Wallace�Zhao

QUESTIONS?•  Email:[email protected]•  Stopbybooth#1124• Careeropportuni5es:h^ps://www.cylance.com/cylance-careers