D4.2 OSINT data fusion and analysis...

ProjectDeliverable

D4.2OSINTdatafusionandanalysis

architecture

ProjectNumber 700692ProjectTitle DiSIEM–Diversity-enhancementsforSIEMsProgramme H2020-DS-04-2015Deliverabletype ReportDisseminationlevel PUSubmissiondate March2,2018Responsiblepartner FCiências.IDEditor PedroM.FerreiraRevision 1.0

TheDiSIEMproject has received funding from theEuropeanUnion’sHorizon2020researchandinnovationprogrammeundergrantagreementNo700692.

D4.2

22

EditorPedroM.Ferreira,FCiências.IDContributorsPedroM.Ferreira,FCiências.IDAlyssonBessani,FCiências.IDFernandoAlves,FCiências.IDNunoDionísio,FCiências.IDAnaRespício,FCiências.IDJoãoAlves,FCiências.IDPedroDiasRodrigues,EDPMarioFaiella,AtosGustavoGonzalez,AtosMichalisMichael,DigitalMRAbdullahiAdamu,DigitalMR

D4.2

33

ExecutiveSummaryThis deliverable reports the continuation of the work presented in the firstdeliverableofworkpackage4-Deliverable4.1(D4.1),“TechniquesandtoolsforOSINT-basedthreatanalysis”,whichwasthemainresultsofTask4.1.D4.1focusedoncharacterisingthelandscapeofsecurity-relatedOSINTsourcesandtheexistingtoolstocollectandprocessOSINT.Additionally,D4.1describedinitialworkonthemodels and techniques that can be used to processOSINT data for predictingthreatsagainstagivenorganisation’sITinfrastructureandonthetechniquesthatcanbeusedtoexpress,shareandintegrategatheredOSINTinastandardizedway.Deliverable 4.2 reports the work progress related with models and tools toprocessOSINT and predict security threats, andwith the techniques and toolsusedtoshareOSINTandintegrateitwitheventsfromwithintheorganization’sITinfrastructure.Itincludesthedetailedarchitectures,modelsandalgorithmstobeimplemented on OSINT-based threat predictors, as well as their applicationresults,whicharethemainoutcomesofTask4.2.First,itdescribesthemachinelearningtechniquesandtoolsdevelopedtoanalysetheOSINTinformationandtoidentifysecurity-relatedtrendsandpredictthreatsto themanaged infrastructure. Secondly, itpresents the toolsdevelopedtouseknowledge from public IP blacklists combined with information from theorganization cybersecurity state to decrease the number of false positiveincidents.Thesetoolscanbeusedtoimprovetheorganizationsthreatawarenessby feeding DiSIEM’s industrial partners SIEMswith the OSINT selected by thetools.Finally,thecomponentemployedtointegrateandprioritizethisinformationinthecontextoftheorganization’sinfrastructureisalsodiscussed.Besidesthedevelopedtools’descriptionsandtheirresults,thisreportpresentsarevisionofthestate-of-the-artinOSINTprocessingworkthatisusefulforadirectcomparisonwithDiSIEM’scontributions.Overall,themainresultsofthisdeliverableare:

• A description of the state of the art in employing machine learning toprocesssecurity-relatedOSINT;

• ThetooldevelopedtoimprovethereliabilityofinformationgatheredfromIPblacklists;

• Theend-to-endarchitecturetoprocessTwitterOSINT;• Themachine learningmodelsdeveloped forOSINTprocessingand their

results;• Athreatscoringmechanismtooptimizetheprioritizationandintegration

ofdiscoveredthreats;• ThearchitectureoftheContext-awareIntelligenceIntegratorcomponent

forthreatsharing;• PragmaticsforaSecurityOperationsCentredeploymentofthedeveloped

tools.

D4.2

44

TableofContents1 Introduction....................................................................................................................................81.1 OrganizationoftheDocument.....................................................................................9

2 RelatedWork................................................................................................................................102.1 WhyuseTwitter?.............................................................................................................102.2 InfrastructureSpecificOSINTApproaches..........................................................102.3 Approachesbasedonunstructuredtext...............................................................112.4 FeedingProtectionSystemswithOSINT..............................................................122.5 DeepLearning....................................................................................................................12

3 OSINTProcessingToolsandTechniques.......................................................................143.1 BlacklistedIPsOSINTProcessing.............................................................................143.1.1 TrustworthyBlacklistsinSIEMSystems.........................................................143.1.2 IPsCollector...................................................................................................................153.1.3 TrustAssessment........................................................................................................153.1.4 TrustworthyAssessmentBlacklistsInterface...............................................163.1.5 ResultsoftheProposedFramework.................................................................17

3.2 Infrastructure-relatedOSINTProcessing.............................................................183.2.1 ExperimentalMachineLearningApproaches...............................................183.2.1.1 Generalmethodology......................................................................................193.2.1.2 SVMandANNapproach.................................................................................213.2.1.3 Deeplearningapproach.................................................................................273.2.1.4 Clustering..............................................................................................................343.2.1.5 Ongoingandfuturework..............................................................................403.2.1.6 PragmaticsforaSOCdeployment.............................................................41

3.2.2 Listening247ThreatPredictor.............................................................................443.2.2.1 Architecture.........................................................................................................453.2.2.2 NoveltiesoftheCyberThreatPredictor................................................463.2.2.3 DataSources........................................................................................................473.2.2.4 Cyber-ThreatModelling.................................................................................483.2.2.5 ExperimentalResults......................................................................................513.2.2.6 Conclusions..........................................................................................................63

4 Context-AwareOSINTIntegration.....................................................................................644.1 NoveltyoftheComponent...........................................................................................644.2 ThreatIntelligentPlatformscomparison.............................................................674.3 Context-AwareIntelligenceIntegratorArchitecture......................................694.4 Context-AwareThreatScoreAnalysis...................................................................72

5 SummaryandConclusions....................................................................................................776 References.....................................................................................................................................78ListofAcronyms...................................................................................................................................83

D4.2

55

ListofFiguresFigure1-WorkflowoftheIPblacklistprocessingframework.....................................15Figure2-AscreenoftheTABI......................................................................................................17Figure3-ConfigurationofaruleconfigurationinArcsightSIEM...............................18Figure4-Thegeneralarchitectureofthemachinelearningapproaches................20Figure5-Comparisonbetweenlinearandnon-linearseparation..............................22Figure6-TheParetocurvesforSVMandMLPusingD1fordatasetsAandABCD,

respectively...................................................................................................................................25Figure 7 - MLP classifier results for infrastructures A, B, C, D, ABCD, and the

classifierensemble,respectively........................................................................................26Figure 8 - SVM classifier results for infrastructures A, B, C, D, ABCD, and the

classifierensemble,respectively........................................................................................26Figure9-Architectureofthedeeplearningapproach......................................................28Figure10-Exampleofaconvolutionoperation...................................................................29Figure 11 - Convolution Neural Network for Sentence Classification (based on

[KIM14]).........................................................................................................................................30Figure12-Comparisonofthefivemodelvariants.............................................................32Figure13-ComparisonofamodelwithoutandwithanadditionalFully-connected

Layer.................................................................................................................................................33Figure14-Thenumberoftweetspresentedusingthreedifferentapproaches...40Figure15-ExampleofanIoCgeneratedfromatweetexemplarinMISTformat.

.............................................................................................................................................................43Figure16-Anoverviewoftheentitiesandtheirrelationships,andtheroleofthe

threatpredictor...........................................................................................................................45Figure17-AtweetdescribingaXSSvulnerability..............................................................49Figure18-Anexampleofaknowledgegraphobtainedfromarecordedfact......49Figure19-Anoverviewofthetwopipelinesthatwillbeusedforthreatprediction.

.............................................................................................................................................................50Figure20-Mostfrequenttokensfoundinthedataset.....................................................52Figure21-Noisefilteringresults:f1-score,precisionandrecall,respectively.....54Figure22-Confusionmatricesforeachofthethreefolds,respectively..................54Figure23-Mostfrequenttermsfoundinthedescriptionoftheexploits...............56Figure24-Numberofexploitsperplatform..........................................................................56Figure 25 - Platformprediction by description results: f1-score, precision, and

recall,respectively.....................................................................................................................58Figure26-Confusionmatrixofthe1stfoldofthe3foldcross-validation.............58Figure27-Confusionmatrixofthe2ndfoldofthe3foldcross-validation............59Figure28-Confusionmatrixofthe3rdfoldofthe3foldcross-validation............59Figure29-Exploittypepredictionbydescriptionresults:f1-score,precision,and

recall,respectively.....................................................................................................................60Figure 30 - Confusionmatrix of the 1st fold for the 3 fold cross-validation for

predictingexploittype............................................................................................................61Figure31-Confusionmatrixof the2nd fold for the3 fold cross-validation for

predictingexploittype............................................................................................................61Figure32 - Confusionmatrix of the 3rd fold for the 3 fold cross-validation for

predictingexploittype............................................................................................................62Figure 33 - Exploit type prediction results: f1-score, precision, and recall,

respectively...................................................................................................................................62

D4.2

66

Figure 34 - Platform prediction results: f1-score, precision, and recall,respectively...................................................................................................................................63

Figure35-Context-AwareIntelligenceIntegratorArchitecture.................................70

D4.2

77

ListofTablesTable1-ComparisonbetweenthepublishingdatesoftwothreatsonTwitterand

NVD...................................................................................................................................................10Table2-Representationofatweetbeforeandafterpre-processing........................21Table3 -The infrastructuredesigned for tweet collectionand filtering, and its

subdivisionintofourcoherentparts................................................................................23Table4-Datasetscollectedandlabellingdetails.................................................................23Table5-Setsofaccountsusedtocreatethedatasets.......................................................24Table6-Thebestconfigurationsobtainedforeachclassifieranddataset............25Table7-ExtensionofthePre-Processingstage...................................................................28Table8-Thestructureofthebestmodelsobtained..........................................................33Table9-Anexampleofaclusteranditsexemplar(inbold).........................................37Table10-Thewordsusedinthenaïvefilter.........................................................................38Table 11 - Results obtained by applying the clustering stage of the tweet

processingpipeline...................................................................................................................39Table12-Sampletweetsandtheirrelevance.......................................................................52Table13-OptimizedparametersforTF-IDF.........................................................................53Table14-OptimizedparametersforSVM..............................................................................53Table15-SampleofreportssubmittedtoExploitDB........................................................55Table16-OptimizedparametersforTF-IDF.........................................................................57Table17-OptimizedparametersforSVM..............................................................................57Table18-TIPcomparison...............................................................................................................69Table19-ExamplesofScoreAggregationFunctions........................................................74Table20-EvaluationoftheIndicatorHeuristic...................................................................76

D4.2

88

1 Introduction Cybersecurity is a matter of growing concern as cyber-attacks cause loss ofincome, sensitive information leaks, and even vital infrastructures to fail. Toproperly protect an infrastructure, a security analyst must have timelyinformationaboutsecuritythreatstotheITinfrastructureandthelatestnewsinterms of updates, patches, mitigation measures, vulnerabilities, attacks, andexploits.Ideally,thisawarenessshouldberaisedwithintheSecurityOperationsCenter (SOC) through Security Information and Event Management (SIEM)software, to allow a correlation between the latest information available andinfrastructureevents.Collecting and processing OSINT is becoming a fundamental approach forobtainingcybersecuritythreatawareness.Recently,theresearchcommunityhasdemonstratedthatmanydifferent typesofuseful informationandIndicatorsofCompromise(IoC)canbeobtainedfromOSINT[LIA16,SAB15,ZHU16].Besidestheseresearch-orientedefforts,allSecurityOperationCentres(SOC)analyststryto be updated about possible threats against the IT infrastructure of theirorganizationsbyfollowingcybersecurityOSINT.Nevertheless,skimmingthroughvarious news feeds is a time-consuming task for any security analyst.Furthermore, an analyst is not guaranteed to find news relevant to the ITinfrastructurehe/sheoversees.Therefore,toolsarerequirednotonlytocollectOSINT,butalsotoprocessittofilteronlytherelevantpartsfortheSOCanalysts,thusdecreasingtheamountofinformationandconsequentlythetimerequiredtoanalyse it and act upon it.When appropriate, the filtered informationmust befurtherprocessedtoextractIoCs.ThetoolsreportedinthisdocumentaddresstheproblemofkeepingSOCanalystsaware of the most relevant threats against the infrastructures under theirresponsibility. Maintaining such awareness requires searching, collecting, andprocessing a high volume of data to obtain relevant knowledge from themostinteresting OSINT sources. This is a time-consuming task, for which securityanalystshavealimitedtimebudget,eventhoughthequalityoftheirworkdependsonthisknowledge.Regarding the processing of non-specific OSINT information, one of theapproaches followed in DiSIEM is based on exploring state-of-the-artmachinelearningalgorithmsandsoftwarepackagestobuildbinaryclassifiersthatdecideifaspecificpieceofOSINTtextmentionsathreattoagivenITinfra-structure.Theproposedsolutionhastwomainobjectives:tomaximizetheamountofrelevantinformationobtained,andtominimizethetimerequiredtoinspectit.Toachievethesegoals,wedesignedaprocessingpipelineconsistingofanOSINTinformationgatherer, an automatic method for selecting the relevant information, and asummarizingfunction.Morespecifically,anautomatedtoolgatherstweetsfromsecurity-relatedaccounts,asupervisedmachinelearningtechniqueselectsthoserelevantforthespecifiedinfrastructurebeingmonitored,andaclusteringmethodisusedtoavoidpresentingrepeatedorunnecessary information.Byusingthisapproach, a securityanalyst canobserveonlya summaryof relevantdata in ashortperiodoftime.

D4.2

99

Another approach consisted in leveraging from existing techniques used forminingmarket trends using social network data for analysing security threatsagainst the monitored infrastructure. This approach was designed by usingDigitalMR’slistening247platform.listening247connectsviaAPIstomultipledataaggregationengines to cover sourcesofon-line text suchasTwitter,Facebook,blogs,boards,videosandnews.TheabilitytocollectandprocessOSINTisoftennotenough.Threatintelligencemustbeexpressedand then sharedusing specific standards, allowing involvedparties to speed up processing and analysis phases of received information,achievinginteroperabilityamongthem.Additionally,thegatheredOSINTshouldbeintegratedwitheventsoriginatingwithintheorganisation’sITinfrastructureand given a threat score indicating its severity and allowing a prioritizationofthreats.Thisdocumentalsodiscussesacomponentdesignedforthesepurposes.

1.1 Organization of the Document Chapter2presentsworkthatisrelatedtothetechniquesandtoolsdiscussedinthisdeliverableforOSINTanalysis.Chapter3presentsadetaileddescriptionofthe developed tools and techniques to identify security-related threats to amonitored infrastructure as well as the results achieved by these. Then, theintegrationofsecurity-relatedOSINTwithsecurityeventsfromtheorganisationIT infrastructure is approached in Chapter 4. Finally, Chapter 5 presents asummaryoftheworkanddrawssomeconclusions.

D4.2

1010

2 Related Work InDeliverable4.1[DIS41]weprovideda fullreviewof the literatureregardingOSINT-based techniques for cybersecurity awareness, and existing commercialsolutionsdevisedtoincreasethesecurityleveloforganizations.Inthissectionwereviewresearchworkthat issimilarorrelatedtotheapproachesdeveloped inDiSIEMandpresentedonSection3.

2.1 Why use Twitter? Asalreadymentionedoneofthetoolsdevelopedwasdesignedspecificallytodealwithsecurity-relatedinformationpostedonTwitter.TwitterisusefulasanOSINTsource since it aggregates timely data frommultiple sources that is simple toanalyseandprocess.Giventhatusersregularly tweetabout theiractivitiesandunusual events found, an interesting question arises about the possibility ofobtainingvaluablesecurity-related information fromTwitterbefore itbecomesavailable on established databases as confirmed threats (e.g., NationalVulnerabilityDatabase,ExploitDB).Somepreviousworksalreadyprovidedgoodevidencethatsuchearlyinformationcan be obtained from Twitter [CAM13, SAB15]. To deepen these findings, webeganastudyconcerningthevulnerabilitypublicationdatesbothonTwitterandNVD.OnemotivatingexampleisprovidedbycomparingthepublicationdatesofNSAtoolsleakedinAugust2016.1AscanbeobservedinTable1,thetoolscalledEGREGIOUSBLUNDERandESCALATEPLOWMANwerediscussedonTwitternineandsixdaysaheadoftheirappearanceonNVD.

Table1-ComparisonbetweenthepublishingdatesoftwothreatsonTwitterandNVD.

NVDID CVE-2016-6909 CVE-2016-7089Publicationdate 24-08-2016 24-08-2016Threatname EGREGIOUSBLUNDER ESCALATEPLOWMANTweetdate 15-08-2016 18-08-2016Daysahead 9 6Tweetlink https://twitter.com/clu

cianomartins/status/765288624044802048

https://twitter.com/evanderburg/status/766309829476429824

2.2 Infrastructure Specific OSINT Approaches A pragmatic approach for the collection of context-specific OSINT is to use akeywordsettoguidethesearch.Theworkspresentedinthefollowingsharethatapproachtoselectpossiblyrelevanttweets,aswellasusingamachinelearningtechniquetoclassifythecollectedtweetssecurity-wise.Mittaletal.[MIT16]useanamedentityrecognizertoextractkeyconceptsfromtweets,whichhave their importanceassertedbyknowledgebase created from

1https://blog.comae.io/shadow-brokers-nsa-exploits-of-the-week-3f7e17bdc216

D4.2

1111

securitydefinitions.Finally,thetweetsareclassifiedusingaNaiveBayesclassifier[ZAK14].Ritter et al. [RIT15] describe how to use a small number of tweets to train anExpectation-Maximization(EM)classifier[ZAK14].SinceEMcanbetrainedusingasmallamountoflabelledsamples,itisanattractiveclassifiersinceitrequireslittlemanuallabourwhencomparingtoothersupervisedapproaches.Formorecomplexproblems,itisfairlysimpletotrainvariousEMclassifierswithdifferenttweets,whereeachclassifierismeanttotackleonlyasubsetoftheproblem.Sabottke et al. [SAB15] use Twitter to collect descriptions of exploits not yetavailableonNVD.ThisworkshowsthatexploitdatacanbefoundonTwitterinaveragetwodaysbeforeitispublishedonNVD.ThesefoundingscorroboratethetheorythatthreatdatacanbediscoveredinOSINTsourcesbeforeitisincludedinthreatdatabases.Trabelsietal.[TRA15]collecttweetsrelatedtoanITinfrastructure,andclusterthem to collate them by subject. Collected threats not referred by NVD areconsiderednovelandhandledlikezero-dayvulnerabilities.ThekeywordsetthatisusedtoselectTwitterdataisasensitiveelementoftheseproposals, as itmay filter important tweetsdue to itspossible incompleteness.Moreover,itmustbemaintainedovertimeasnewimportantkeywordsshouldbeaddedandirrelevantkeywordsshouldberemoved.

2.3 Approaches based on unstructured text Thissectionpresentsasummaryofrelatedworkthatisrelevanttounderstandthecontextandmotivationforthedesignchoicesofthelistening247-basedthreatpredictor.Threatpredictionapproaches in literature canbe classified into twoapproaches;(i)approachesthatutiliseontologiesorrulesthatleveragemachine-readableknowledge[MIT16,MIT17,SAP17],(ii)approachesthatutilisemachinelearningapproaches[NUN16,HOV12,QUE17].Techniques that utilise ontologies rely on a pipeline that consists of parsingunstructuredtextfromOSINTsourcestoidentifyentitiesandtheirrelationships,representingtheserelationshipsbetweentheentities,andincorporatingreliablefactsinamachine-readableformatasaformofstoredknowledge.Thispipelinehandlesnotonlythefusionofinformationintoasingleform,suchasaknowledgegraph used in [MIT17], but it also comeswith safeguards that prevent againstunreliable information, such as only using curated factswith a Cohen's kappaaboveacertainthresholdasin[MIT17]orbinaryclassificationofrelevancebasedontokenscontainedinthetweetsasin[MIT16].Themajormeritofapproachesthatstoremachine-readablefactsinknowledgegraphsisthatthegraphscannotonlybevisualised;machinelearningalgorithmscanlearnfromthesegraphsandbeabletoinfernewrelationshipsbetweenentitiesthatareyettobeknown.Forexample, inferring the likelihood that a certain software will be affected by acertain vulnerability, given its properties and dependencies. However, both[MIT17]and[MIT16]don’t takeadvantageof stateof theartmachine learning

D4.2

1212

algorithms that are better able to learn the complex relationships entities canhave.On the other hand, techniques that utilise machine learning approaches onlabelled unstructured text fromOSINT sources typically utilise a pipeline thatconsistsof textpre-processing,vectorisationandclassificationusingamachinelearning algorithm [NUN16, HOV12, QUE17]. In someworks, the pipeline alsoconsistsofanoisefilteringstepthateliminatesnoise.Forexample,Nunesetal.[NUN16] includedanoise filteringstepwhichuseda classifier for filteringoutirrelevantdata.Theseapproachespre-processthetext,whichcanbetweetsormalicious code, tokenise, and vectorise them, typically using bag-of-wordsapproaches[HOV12,QUE17].ThisisfollowedbyaSupportVectorMachine(SVM)whichisknowntobearobustclassifierwithlimitedannotateddatathatmightnotbesufficientforend-to-enddeeplearningapproaches.Theseapproachesshowedpromisingresultsevenwithoutenrichingthetextwithdomainknowledgeatthelevelsofdomainknowledgeincorporatedintheontology-basedapproaches.Thelackofleveragingdomainknowledgeintheformofknowledgegraphsisamissedopportunityintheseapproaches.

2.4 Feeding Protection Systems with OSINT Another group ofworks collect and transformOSINT into amachine readableformat to feed it into IntrusionDetection Systems (IDS), anti-viruses, or othertools.Mathewsetal.[MAT12]aimtocollectinformationfromtraditional(e.g.,networkdata,logs)andnon-traditionalsources(chat-rooms,forums,blogs)andfeedthemtoanIDS.AnontologyreceivesdatafromboththeIDSandaTrafficFlowClassifier(acomponentthatmonitorspackets’headerstoinferthetraffic’slegitimacy),andusesasetofrulesplusthecollectedOSINTtodetectattacks.Liao et al. [LIA16] developed a framework for extracting IoCs from technicalliteratureasitpossessesamorepredictablestructure,enablinghighrecallofthemethodology. The terms are extracted and converted to the OpenIoC format,whichcanbeautomaticallyprocessedbyseveraltools.In a different work, Zhu et al. [ZHU16] present a system that processes thescientificliteraturedescribingAndroidmalwareandextractfeaturesdescribingtheattackstocreateamalwarerecognizer.Whencomparedtoamanualapproach,thisfully-automatedworkobtainedsimilarresultswhileusingmuchlessfeaturesandeasilyallowingupdates.

2.5 Deep Learning As agreed in the DiSIEM project agreement, besides well-establishedmachinelearningmodels, such as Support VectorMachines orMulti-Layer Perceptrons,alsomorerecenttechniquesbasedondeeplearningarchitecturesandalgorithmsshould be analysed. One of the most common applications of Deep Learningmodelsisforclassificationtasks,namelysentimentanalysisinwhich,forexample,

D4.2

1313

amodel isdevelopedtoclassify ifaproductreviewispositiveornegative.Weintend to createmodelswith a similar prospect: instead of understanding if areviewispositiveornegative,wewantamodelcapableofdetectingifatweetisreferencingarelevantsecurityeventtoagivenITinfrastructureornot.Kimetal.[KIM14]describeaConvolutionalNeuralNetwork(CNN)forsentenceclassification that usesmultiple filterswith varyingwidths to extract featuresfromthesentence.Accordingtothepaper,thismodelwascapableofimproving4outof7taskswhencomparedtoresultsofpreviousmodels.OnemodelwhichisregularlyusedforsentenceclassificationisaRecurrentNeuralNetwork(RNN),specificallytheLong-ShortTermMemory(LSTM)network.Thecapacityofthesenetworkstocapturecontextualinformationthroughitsabilitytoretainprevious informationmayofferasignificantadvantage.There isalsotheoptionofcombiningbothofthesenetworks,asproposedbyZhouetal.[ZHO15].BycombiningthestrengthsfrombotharchitecturestheauthorspresentC-LSTM,using a CNN to acquire sequences of higher level representations and a LSTMnetwork to obtain the sentence representation. The authors report that thenetworkoutperformsbothCNNandLSTM.AnotherworkthatusesaCNNclassifiertocategorizeanentity[CMY17]addsaninputchannelthatcontainstheentitywhichistobeclassified.Thepaperaimstotacklehypernymidentificationandreportsonsuccessfullyidentifying1.1millionentitieswith a precisionof 99.36%. For ourwork,we intend to analyse if thisapproach can be adapted to extract relevant information from the entitiesreferencedinthetweettofillanIoC.

D4.2

1414

3 OSINT Processing Tools and Techniques In this chapterwe present the architectures, the state of development and theresultsofthetechniquesproposedanddiscussedonDeliverable4.1–TechniquesandtoolsforOSINTthreatanalysis[DIS41].

3.1 Blacklisted IPs OSINT Processing Blacklistsare listscontainingOSINTinformationaboutuntrustedelementsandare a typical tool used as cyber-defencemechanism [KÜH14]. DiSIEMongoingresearch focuses on using knowledge from IP blacklists, which are lists of IPaddressesdeemedasmalicious,andcombiningthemwith internal informationfromtheorganizationcybersecuritystate.

3.1.1 Trustworthy Blacklists in SIEM Systems Themainobjectiveoftheongoingresearchistoimprovethecapacityofmalwaredetection by the SIEM. Although the use of public IP blacklists reinforces thecybersecurity by monitoring the organization network communication, theseblacklists provide a significant percentage of false positives [KÜH14, SIN08].Therefore,thesecondaryobjectiveoftheworkisthereductionoffalsepositiveswhenassessingthelegitimacyofcommunicationswithIPaddressessuspectedofmaliciousactivity.Toaccomplish theproposedobjectives, theDiSIEMresearchfocusesonassessingthereliabilityofasetofpublicblacklistsandcorrespondingIPs.FCiências.IDandEDPhavebeenworkingonthedevelopmentandvalidationofthissolution.To obtain a more reliable list of malicious IPs leading to a reduction of falsepositives it is necessary to classify the reputation of each IP address and eachblacklist.Thisassessment isdoneusingspecificsecuritymetrics.Blacklistsandtheircontentsmustbeevaluatedcontinuously(orwheneverthelistschange)andmustconsiderthecasesofcommunicationsfromtheorganization’snetworkstoblacklistedIPaddresses.Figure1presentsanarchitecturaloverviewoftheframeworkdeveloped,whichincludesfourmodulesthatcanbeusedindependently[DIS41].ThefirstistheIPCollector, a program with the purpose of gathering information from publicblacklists. The second is the Trustworthiness assessment, which evaluates thereputationofthemaliciousIPaddressesandtheblackliststhatcontainthem.TheTrustworthyAssessmentofBlacklistsInterface(TABI)applicationconsistsofawebmanagement interface on the IP addresses, blacklists and cases related withcommunications between the organization and IP addresses suspicious ofmaliciousness.Finally,areputablelistofIPs(BADIP.csv)isintroducedintheSIEMand the rules for monitoring and generating alarms are defined. Thesecomponentsaredescribedinthenextsubsections.

D4.2

1515

Figure1-WorkflowoftheIPblacklistprocessingframework.

3.1.2 IPs Collector TheframeworkusestheOSINTconcepttogatherinformationofapre-specifiedsetofpublicIPblacklists.TheIPCollectorcollectsalltheIPaddressesfromasetof public blacklists; normalizes the IP addresses syntax; and establishes theassociationsbetweentheIPaddressesandtheblacklistsfromwheretheyweregathered.The systemruns continuously, and the collectionof IPs isperformeddaily.Thesources(28)andblacklists(121)usedforthecasestudywereselectedafterathreemonthsofinvestigationperiodandarelistedinDiSIEMdeliverable4.1[DIS41].

3.1.3 Trust Assessment The trust assessment aims to classify the reputation ofmaliciousness of an IPaddressandthecredibility(trustworthiness)ofeachblacklist.ThetrustworthinessofablacklistisdeterminedusingEquation(1).

𝑇𝑟𝑢𝑠𝑡𝑤𝑜𝑟𝑡ℎ𝑖𝑛𝑒𝑠𝑠 = 𝑤/𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑤3𝐻𝑖𝑠𝑡𝑜𝑟𝑦 (1)

Where:

• 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛is computedaccording to the standard ratiobetweenpositivecases and totalnumberof cases, considering the ground truth resultingfromtheinvestigationoncasesrelatedtotheblacklist;

D4.2

1616

• 𝐻𝑖𝑠𝑡𝑜𝑟𝑦istheweightedtrustworthinessoftheblacklist,overthelastthreemonths;and

• 𝑤/and𝑤3areweightsforthetwocomponentsconsidered,𝑤/, 𝑤3 ∈ [0,1],𝑤/ + 𝑤3 = 1.

The reputationofmaliciousnessof an IPaddress is computedconsidering fourmetricsaccordingtoEquation(2).

𝑅𝑒𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛 = 𝑤/𝑡𝑓 + 𝑤3𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑤A𝐵𝑙𝑎𝑐𝑘𝑙𝑖𝑠𝑡EFGHIJG+𝑤K𝑃𝑒𝑟𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑒 (2)

Where• 𝑡𝑓(TermFrequency)isthenumberofreportsoftheIPaddress(intheset

of all blacklists) divided by the maximum number of reports of an IPaddress;

• 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛istheratiobetweenthenumberofincidentsandtotalnumberof cases (incidents and falsepositives), considering the investigation oncasesrelatedtothisIP;

• 𝐵𝑙𝑎𝑐𝑘𝑙𝑖𝑠𝑡EFGHIJG is the average trustworthiness of all the blacklistscontainingthisIP;

• 𝑃𝑒𝑟𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑒measuresiftheIPwasreportedorrelatedwithincidentcasesintheperiodofthelastthreemonths,and

• 𝑤/,𝑤3,𝑤Aand𝑤Karetheweightsforthefourcomponentsconsideredand𝑤/,𝑤3,𝑤A,𝑤K ∈ [0,1],𝑤/ + 𝑤3 + 𝑤A + 𝑤K = 1.

Asthesolutionshouldbeadaptabletotheenvironmentofdifferentorganizations,whenan IPhasnotbeen informedbyblacklists, it isonlydiscarded if it isnotassociatedwithpositivecases.

3.1.4 Trustworthy Assessment Blacklists Interface TheTrustworthyAssessmentBlacklistInterface(TABI)isawebinterface,whichisbeingdevelopedtoallowmanagingandvisualizinginformationrelatedwiththeblacklists, suspicious IPs, opened incident cases being researched and publicorganization’s IPs. TABI allows a centralized management of the entireframework,withouttheneedforcodewritingorfileconfiguration.Thetoolfeaturesincluderemovalandeditionofblacklistsandincidentcases,tobeusedintheassessment.TheTABIapplicationalsoallowstodiscoverifapublicIP of the organization is contained in any of the public blacklists. For thisfunctionalitytobeoperational,itisnecessarytohaveaccesstoalistofthepublicIPaddressesoftheorganization.AnexampleoftheTABIdashboardisdisplayedinFigure2.TheIPsaswellastheblacklistsaresortedaccordingtotheirtrustworthinessandlistsofthe“top10”maliciousIPsandtrustworthyblacklistsareshown.TheIPsrelatedwithlastcasesarelisted.Inaddition,thestatisticsincludethenumberofIPused,the%ofquietfeeds,thenumberofcases,andthenumberoforganizationIPsintheblacklists.

D4.2

1717

Figure2-AscreenoftheTABI.

3.1.5 Results of the Proposed Framework The framework has been validated and iteratively improved. The experimentshavebeenperformedintheEDPenvironment.AtEDP,theArcSightSIEMhasaninputaverageof12000epsandanaverageof200securityincidentspermontharehandledbytheSOCteam.Weconductedastudyexperienceinaperiodoffivemonths.Overthattime,thesolution started toprioritize the IPaddresses thatwerebeing reportedby theblackliststhathadabettertrustworthinessscore–calculatedbytheassessmentmodule.ThescoreaidstheSOCteamtohaveknowledgeaboutthepublicblackliststhat were providing more suitable information about possible threats for theorganization.ArulewasconfiguredintheEDPSIEMtoconsideronlyIPaddressesrankedabovethe85percentile(infileBADIPoftheframework).Figure3displaysanexampleofaruleconfigurationinArcsightSIEM.TheimplementationofthesolutionatEDPresultedinanincreaseofaround80security casespermonth (cases requiring investigation).This resulted fromanenhancementontheoverviewontheorganization’snetworkbyhavingnewSIEMrules.BymaintainingtheIPaddressesthatwerecontinuouslyreportedbytheblacklistsandthatwereassociatedwithpositivecases(precision),thesolutionimprovedthenumberof truepositivecasesand ledtoan increaseof2,57%inprecision,when comparing with a list used by the SOC, which includes public andprivate/paidblacklists.However, consideringall the cybersecurity toolsusedbyEDP (which comprisemanycommercialtools),theprecisionofoursolutionwas1,2%smallerthantheoneoftheremainingtools.Nevertheless,thisshouldbeinterpretedasapositiveresult,becauseoursolutiononlyusespublicinformationandisnotpaid.

D4.2

1818

Figure3-ConfigurationofaruleconfigurationinArcsightSIEM.

The persistence and the precision components are the main factors of thesolution’s good results. These components allow to consider the internalinformationof the incidentsoperationsby theSOC team,aswell as to classify,maintainordiscardanIPaddress.

3.2 Infrastructure-related OSINT Processing ThenextsubsectionspresenttheresultsachievedbythemachinelearningtoolsthatwereproposedinDeliverable4.1[DIS41]andweredevelopedsincethen.Thetools are meant to provide SOC analysts with timely data concerning the ITinfrastructure under their care. Since analysts have a limited time budget toinspect the latestnews, thesetoolsaredesigntoaccuratelydistinguishwhat isrelevantfromwhatisnot.

3.2.1 Experimental Machine Learning Approaches As agreed in the DiSIEM project proposal, besides the DigitalMR commerciallistening247 platform, two main types of machine learning approaches wereevaluated:well establishedmethodologies suchasSVMsandclassicalArtificialNeuralNetworks(ANN),andmoderndeeplearningapproaches.InbothcasesTwitterwaschosenastheOSINTsourcefortwomainreasons.First,itisrecognizedasanaggregatorofinformationrelatedtooccurringeventsofallkinds2, including cybersecurity-related events as demonstrated by the highly-activeaccountsofmostsecurityfeedsandresearchers,wheretheyactivelytweet2 https://www.americanpressinstitute.org/publications/reports/survey-research/how-people-use-twitter-in-general/

D4.2

1919

security-related news [CAM13, SAB15]. Further, as tweet is limited to 280characters (around40-60words), thesesmallmessagesaresimpler toprocessautomatically,enablingveryhighlevelsofaccuracyandlowfalsepositiverates.Taking in consideration thepragmaticsof aSOC reality, these toolshave threemainobjectives:

1. Maximizetheamountofrelevantinformationselected;2. Minimizetheamountofirrelevantinformationselected;3. Aggregatesimilaritems.

Thefirstobjectiveaimstoavoiddiscardingrelevantinformation,whilethesecondaims to avoid presenting irrelevant information to the analyst. These twoobjectivesarecriticaltothetrustworthinessofanewsfeed,astheanalystmustbesurethepresentedthreatsarerelevant,andthatnootherthreatsareknown.Thefinal objective is important to avoid the presentation of duplicate information.Since Twitter is the selected data source, it is important to avoid presentingretweetsandthestreamofsimilartweetsaboutthesameeventsandthreats.Although our approach produces machine readable data for SIEMs or othersecuritysystems(asisdonebyotherworks[LIA16,ZHU16]),whattheanalystwilldowiththerelevantinformationisnotourprimaryconcern.Nonetheless,findingthe relevant information and filtering possible duplicates is a first step in anyautomaticinformationprocessingpipeline.Therefore,thisworkcanbeintegratedwithanyotherthatusesTwitterasaninformationsource.

3.2.1.1 General methodology Inthefollowingsectionswepresenttheelementscommontobothapproaches.Figure4presentsthearchitectureofbothOSINTprocessingpipelines:gatheringtweets,filtering,numericalrepresentation,classification,andfinallyclusteringtosummarize the results. The following sections describe each of the processingstages.

Data collection The collector module requires a set of accounts fromwhich to collect tweets.Thesecanbeaccountsofsecurityanalystsandcompanies,vendors,hackers,users,researchers, among others. Accounts should be chosen by considering thelikelihoodthattheseuserstweetaboutthesecurityofelementsbelongingtotheITinfrastructurebeingprotected.

D4.2

2020

Figure4-Thegeneralarchitectureofthemachinelearningapproaches.

Weoptedforcollectingtweetsfromselectedaccountsinsteadofakeyword-basedapproach, since the latter is likely to provide large amounts of irrelevantinformation.Tweetswiththeword“windows"includeallWindowsrelatedtopics(theoperatingsystem)andalltweetsreferringglasswindows,besidesothernon-security related topics. By collecting data only from selected security-relatedaccounts,alargerfractionofthecollectedtweetsarerelatedtoITsecurity,leavingthe focus on filtering tweets not mentioning threats to the relevant ITinfrastructure.The data collector uses Twitter’s API tweet stream capability to continuouslycollecteverytweetpostedbythespecifiedsetofaccounts.

Filtering The collected datasetmost likely includes tweets that are not relevant for theinfrastructureundertheanalyst’scare.Therefore,theyhavetobedroppedbyafilter.Thefilteringmechanismisbasedontheassumptionthatatweetreferringathreatto a certain IT infrastructure asset, has to mention the asset properties.Consideringthisassumption,asecondinputhastobeprovided:asetofkeywordsdescribingthemonitoredITinfrastructure.Onlytweetsthatincludeatleastoneof thekeywordswillpass the filter.Keywordsrestrict thescopeof thesecurityevents,hencedecreasingtheamountofirrelevanttweetsbeyondthefilter.Thekeywordsshouldcoverasmanyelementsof theinfrastructureaspossible.Forexample,iftheanalystisinchargeofsecuringaLinuxclusterrunningvirtualmachinestoserveawebservicewithadatabase,thekeywordsetcouldbe:

{linux, ssh, virtualbox, mysql, apache http, php}

Pre-Processing To normalize the data representation before classification, each tweet goesthrough a pre-processing stage, where we remove any hyperlinks and specialcharacters, except “.” and “-”, since thesemay give relevant information aboutproductversions(e.g.,Linux4.15.3-1-ARCH).Then,everynumberisconvertedtoitstextualcounterpart(e.g.“0”becomes“zero”),shiftalltexttolowercase.Table2presentsatweetbeforeandafterpre-processing.

D4.2

2121

Table2-Representationofatweetbeforeandafterpre-processing.

Beforepre-processing:#0daytoday#OracleGlassFish Server 4.1 -DirectoryTraversal Vulnerability[webapps#exploits#Vulnerability...https://t.co/KBcnWUc355Afterpre-processing:zerodaytodayoracleglassfishserverfourpointonehyphendirectorytraversalvulnerabilitywebappsexploitsvulnerability

3.2.1.2 SVM and ANN approach Thenext subsectionspresent thespecific aspectsofbothSVMandclassicANNapproaches,followedbytheresultsobtained.

Feature extraction Beforeenteringtheclassificationstage,tweetshavetobeconvertedtoanumericalformatsuitableforsupervisedbinaryclassificationalgorithms.Thisworkusesthewell-known Term Frequency - Inverse Document Frequency (TF-IDF)technique[LES14]. In summary, TF-IDF computesweights towords (features)basedon thenumberof times theyoccur ineach specificdocumentand in thegroupofdocumentsconsidered.TheTF-IDFvalueofaword increaseswiththefrequencyofoccurrenceof thatword inadocument,but isscaleddownby itsfrequencyofoccurrenceinthegroupofdocuments.Ifawordisveryfrequentandappearsinthemajorityofdocuments,ithasalow(orzero)weight.Ifawordisinfrequentandappears in fewdocuments, itwillhaveahigherweight.Aseachword ismapped to a vector index, a tweet can be represented by a numericalvectorwithTF-IDFvaluesintheindicescorrespondingtothewordsinthetweet,andzeroselsewhere.As the number ofwords in a document setmay be extremely large, and evenvariableinthecaseofstreamprocessing,thehashingtrick[WEI09]isused.Thisallowseachtweettoberepresentedbyafixedsizenumericalvector,whereeachelementiszeroortheTF-IDFvalueofthewordmatchedbythehashingfunctiontothatelementindex.

Classification For theautomatic classificationof tweetsaccording totheirsecurity relevance,twoclassifiershavebeenexplored:SupportVectorMachines(SVM)[COR95]andMulti-LayerPerceptron(MLP)NeuralNetworks(NN)[ROS58,RUM85].TheSVMis a broadly-used classifierwhich demonstrates good results in amultitude ofcontexts.ThisworkconsiderstheSVMimplementationavailable in theApacheSparkMachineLearninglibrary(MLlib),whichemploysalinearkernel,therebyassumingthattheinputvectorsarelinearlyseparableininputspace.As the MLlib does not provide a non-linear kernel SVM, MLlib’s MLP NNimplementation was also considered in order to obtain results based on theassumption that the inputvectorsmaynotbe linearly separable.TheMLP is a

D4.2

2222

well-established and frequently used NN architecture, which has a long trackrecordofgoodandconsistentresultsoveravastnumberofclassificationtasks.Figure 5 illustrates the difference between linear and non-linear separation ofinputs in classification tasks.The leftplot shows thata linearmodel fits a linedividinginputdataintwoclasses.Non-linearmodelstransforminputdatanon-linearlyintoanintermediatespacewheredatabecomeslinearlyseparable.Thisisequivalenttofindingacurveoninputspacethatsuccessfullydividesthedataintwoclasses,asshownintherightplotofthefigure.

Figure5-Comparisonbetweenlinearandnon-linearseparation.

Experimental setup This section describes the experimentalwork thatwas carried out to obtain aworkingtweetprocessingpipelineasdescribedintheprevioussection.Allcodeiswritten inScalausing theApacheSparkFramework3pre-builtwithHadoop.Sparkwas chosen as their data-structures are scalable and designed for largedatasets. Also, Spark includes a scalablemachine learning library calledMLlib,usedtoimplementallMLbasealgorithmsemployedinthisreport.

Infrastructure definition InalargeorganizationtheITinfrastructureiscomposedofmanyhardwareandsoftwareassets.ByusingriskanalysistheanalystselectsasubsetforwhichOSINTshouldbecollected,filteredandsummarized.Thediversityofassetsthatmaybeselectedinalargeandcomplexorganization,raisesonequestionrelatedtotheclassificationstageofthepipeline:isitbettertohaveoneclassifiercoveringthewhole infrastructure being monitored, or is it preferable to have multipleclassifiersfocusedonspecificparts?While within the DiSIEM project we are collecting tweets relevant for ITinfrastructuresoftheindustrialpartners(EDP,Amadeus,Atos),theresearchworkthat has been conducted could not wait for the collection and labelling of asufficientlylargedatasetforeachpartner.ThereforeanexistingtweetdatasetthatconsidersanhypotheticalITinfrastructurehasbeenemployed.3http://spark.apache.orgOnline,accessed19/02/2018.

D4.2

2323

ThehypotheticalITinfrastructuredesignedpreviouslyforexperimentalworkwasdividedinfourparts,aspresentedinTable3.Thetableshowsthekeywordsthatareusedinthefilteringstage.Thelastrowconsidersthecasewhereonesingleclassifierwillbefedbytweetsrelatedtoanyofthefourinfrastructureparts.Table3-Theinfrastructuredesignedfortweetcollectionandfiltering,anditssubdivisionintofour

coherentparts.

Label KeywordsA oracle,ciscoB googlechrome,chrome,internetexplorer,

firefox,microsoftedge,edgeC wordpress,joomla,wpD microsoftwindows,ms,linux,

operatingsystem,operatingsystemsABCD A,B,C,DPartAisasimplerepresentationofCiscoandOracleproducts,partBconsidersthebrowsersusedintheorganization,partCrelatestothecontentmanagementsystemsdeployed,andpartDconsiderstheoperatingsystemsused.

Tweet collection and labeling ThreedatasetswerecollectedduringthreeperiodsoftimeasshowninTableIV,wherethecollectionperiod,thesetsofaccountsused,thenumberoftweets,andthedistributionover the infrastructureparts,mayallbeobserved.Afterbeingcollected,eachtweetwasvisuallyinspectedandmanuallylabelledaspositive(thetweetmentionsa threat toagivenpartof theIT infrastructure)orasnegative,thuscreatinglabeleddatasetssuitableforsupervisedlearningapproaches.FourrowsinTable4identifythenumbersoftweetsrelatedtoeachoftheconsideredparts.NoticethatthenumberofABCD’stweetsislessthanthesumoftheparts.Thishappensbecausetweetsmentioningmorethanoneinfrastructurepartwerenotduplicatedwhenthetweetsweremerged.

Table4-Datasetscollectedandlabellingdetails.

Dataset D1 D2 D3Timeperiod(from/to)

01/11/201501/04/2016

01/04/201615/05/2016

15/05/201610/07/2016

Accountsets S1 S1,S2Num.tweets 71024 57579 66608 Pos. Neg. Pos. Neg. Pos. Neg.RelatedtoA 556 514 177 249 502 256RelatedtoB 217 497 86 446 420 362RelatedtoC 486 606 138 900 425 303RelatedtoD 441 691 138 2697 336 1232Relatedtoall(A,B,C,D) 1697 2008 536 4292 1680 2153

D4.2

2424

AsshowninthethirdrowofTable4,twosetsofaccounts,S1andS2,wereusedfortweetcollection.TheaccountspersetareidentifiedinTable5.

Table5-Setsofaccountsusedtocreatethedatasets.

S1accountsinj3ct0r, TrustedSec, Anomali, briankrebs, Secunia, exploitdb, alienvault,slashdot, dstrom, Info Sec Buzz, vuln lab, threatintel, dangoodin001,ivspiridonov, ThreatFeed, pikisec, SANSInstitute, johullrich, drericcole,F1r3h4nd,MaldicoreAlerts,USCERTgov,gcluley,halpomeran,SecurityWeek,SecurityNewsbot,sansisc,ekasperskyS2accountsTenableSecurity, securitywatch, securityaffairs, zer0element, notsosecure,CyberExaminer, SCMagazine, DMBisson, lennyzeltser, IT securitynews,teamcymru, WordPress, MicrosoftEdge, JoomlaTips, sjzaib, SecurityMagnate,Cisco,Dell,linuxtoday,securityninja,cyberopsy,OWASPJava,WPScan,dplusk,threatpost, Rootsector, Microsoft, linuxfoundation, ChidoDike, Sec Cyber,ptracesecurity, msftsecurity, LinuxSec, hack3rsca, CiscoSecurity, NytroRST,joomla, Windows, crackerhacker00, fstenv, HPE Security, googlechrome,wordpressdotcom, packet storm, RokaSecurity, Oracle, firefox, wpbeginner,YoKoAcc,SecurityCrap,jasonlamsec,threatmeter

Feature extraction WeusedSpark’simplementationofTF-IDFwithdefaultparameters,exceptforthefeaturevectorsize.Inordertofindasuitablevectorsizetodescribethetweets,elevenvaluesweretested:{30,50,80,100,200,300,500,750,1000,1500,3000}.Thisrangecoversfromlowtohighdimensionalvectors,andwithitweshouldbeabletofindanappropriatevectorsizeforthedatasets.

Classification Inthedesignoftheclassifiers,eachrelevantparameterwasvariedinordertofindwhichisthebestapproachforthisapplicationandhowitshouldbedesigned.ForSVMwevariedC(theregularizationparameter)within{0.01,0.02,0.05,0.1,0.2,0.5,1,2,5},andthestepsize(aparameterfortheStochasticGradientDescentmethodusedforthetraining)within{0.1,0.5,1,1.5,2,5}.FortheMLP,thenumberoflayersvariedfrom2to8,andthenumberofneuronsperlayerwithin{5,7,10,12,14,16,18,20}.ThemodelsweretrainedusingdatasetD1andevaluatedbyperforming10-foldcross-validation.Thelimitonthemaximumnumberoftrainingiterationswassetto100fortheSVMand200fortheMLP,whichweretestedvaluesthatachievedgoodparameterconvergence.Figure6showsParetocurvesforsomeofthetestedconfigurations.Inthatfigure,eachpointshowstheaveragevalueobtainedbyaspecificconfigurationoverthe10-fold cross-validation procedure. The Pareto front is shown with lines

D4.2

2525

connectingthedominantconfigurations in termsofTruePositiveRate(TPR,x-axis) and True Negative Rate (TNR, y-axis), for both types of classifiers. ForinfrastructurepartA(Oracle,Cisco),itispossibletoseethattheSVMsolutionsdominate the MLP ones. A possible explanation is that this infrastructuregenerated a simpler dataset, whose patternswere easily captured by a linearclassifier,butnotcomplexenoughtoproperlytraintheMLPNNconfigurations,whichmayhaveover-fittedthedata.Fortheotherinfrastructures,theresultsarenotsoclearintermsofdominance.The highlighted points in the top-right of the figures are the Pareto-optimalconfigurations: theoneswiththebestbalancebetweenTPRandTNR(smallestdistanceto theoptimum).Table6presents theseconfigurations,revealingthatthereisaclearadvantageinusinghigh-dimensionalfeaturevectors.Themodelspresentedwereusedforfurtherevaluationoftheapproach.

Figure6-TheParetocurvesforSVMandMLPusingD1fordatasetsAandABCD,respectively.

Table6-Thebestconfigurationsobtainedforeachclassifieranddataset.

Configurations A B C D ABCD

SVMFeaturesizeStepsize

C

30000.50.5

30000.051.5

30000.25.0

30000.011.5

10000.055.0

MLPFeaturesizeNum.layersNeurons/layer

150045

3000720

3000310

3000720

3000510

Results The tweet processing pipeline was evaluated by using the selected models,employing datasetsD2 andD3. These setswere generated with tweets in thefuture of those in the training set (D1), and include information posted by anadditionalsetofaccounts(S2)notconsideredinthetrainingstage.Thisevaluationmethodologyembodiestheideathatinarealsituation,afterbeingtrained,modelswillclassifydatafromfutureevents,andthatovertimenewTwitteraccountswillbeaddedto(orremovedfrom)thesystem.

D4.2

2626

Consideringthatcross-validationwasemployedduringthemodelselectionphase,itshouldbenotedthattheselectedmodelconfigurationswereretrainedusingthewholeD1dataset.ThefeaturevectorscorrespondingtotweetsinD2andD3weregeneratedusingtheTF-IDFmodeldeterminedusingdatasetD1.ThisguaranteesthatTF-IDFweightsattributedtowordsinD2andD3willbecoherentwiththoseemployedtotraintheclassifiers.Figure7andFigure8showtheperformanceofthebestMLPandSVMclassifiers(seeTable 6) in terms ofTPR, TNR, FPR (False PositiveRate), and FNR (FalseNegative Rate), presenting also the average result obtained by 10-fold cross-validationwithD1.Astheensembleofclassifiersiscomposedofoneclassifierforeach infrastructure part, its output is obtained by computing the performancemetricsaftermergingtheoutputsoftheclassifiers.TheD1columnintheensemblegraphwasobtainedbytestingthemodelswiththetrainingdata,henceshowingnear-perfectrates.

Figure7-MLPclassifierresultsforinfrastructuresA,B,C,D,ABCD,andtheclassifierensemble,

respectively.

Figure8-SVMclassifierresultsforinfrastructuresA,B,C,D,ABCD,andtheclassifierensemble,

respectively.

Ingeneral,theresultsareslightlyworseinD2andD3whencomparedtoD1(asexpected),sincenewdatapresentsunmodeledpatternstotheclassifiers.Astimepasses and as new accounts are added this effect should be present with anincreasedimpact.FocusingontheresultsobtainedindatasetsD2andD3,ingeneraltheclassifiersmaintainveryhighTPRandTNR,exceptforthemodelsspecifictoinfrastructurepartB,thatexhibitasignificantdropinTPR.ThismightbeexplainedbythefactthatpartBhasthesmallestnumberoftrainingexamples(seeTable4)andalsothe highest class imbalance (the negative examples are more than twice thenumberofpositiveones),hencebeingmoresensitivetothenoveldataofD2andD3.Inmostcases,theTNRishigherthantheTPR.TheexceptionsaretheSVMmodelforinfrastructurepartA,whereTPRishigherthantheTNR,andtheSVMmodel

D4.2

2727

forinfrastructureABCDandMLPmodelforinfrastructureD,wheretheresultsarecomparable.Theimbalancebetweenpositivelyandnegativelylabeleddatainthetrainingdatasets(almostalwaysmorenegativesamples)explainsahigherTNR.Unliketheotherinfrastructureparts,AandABCDaretheonlyoneswithbalancedsamples,whichjustifieswhytheirmodelsshowabetterTPR/TNRbalance.IntermsofFPRandFNR,themodelsforinfrastructureAshowthehighestFNR,whilethemodelsforinfrastructuresBandCshowaconsiderablyhighFPR(withthe exception of the SVM model for infrastructure C). For the remainderinfrastructures, theFPR isusuallybetween10%and20%,and theFNRbelow10%.ThehighFPRismostlikelyconnectedtothetypeoftweetspostedregardingthese software elements (browsers and content management systems), whotypicallycausediscussionregardingtheircharacteristicsandproblems,butnotalwaysrelatedtosecurity.When comparing SVMs and MLPs considering the specific models for theinfrastructure parts, the results are comparable in terms of TNR, but the SVMshows a consistent advantage regarding the TPR. Regarding the classificationmodelstrainedforthecompleteITinfrastructure,SVMachievedthebestresults,alsoshowingthebestbalancebetweenTPRandTNR,withFPRandFNRbelow10%.Theensemblesofclassifiersshowverysimilarrates.One of the questions raised in thiswork concerned the use of classifiers. Twoalternatives were considered: one model per infrastructure part (all togetherrepresentedbytheclassifierensemble)vs.onemodelforthewholeinfrastructure.With the exception ofpartD (for both SVM andMLP), themodels for a singleinfrastructurepartsufferedfromanimbalancebetweenTPRandTNR.ThisisalsothecasefortheensemblesofclassifiersandfortheMLPcompleteinfrastructuremodel.Whencomparing the twopossibilities, forourdatasetsand thedefinedinfrastructure parts, themodels that achieved the highest TPR and TNRwerethosethatencompassthewholeinfrastructure;further,theSVMmodelforABCDachievedresultssuperiortothoseattainedbytheMLPone.Overall, theSVMmodel for thewhole infrastructure standsout, achievinghighTPRandTNRandagoodbalancebetweenthesemetricsonD2andD3.

3.2.1.3 Deep learning approach The sections below describe the classification approach using deep learningarchitectures.

Methodology Intermsofthegeneralarchitecture,ourDeepLearning(DL)approachdrawsalotfromthepreviousonedescribedinSection3.2.1.1.AsdisplayedinFigure9,theinitialstagesofdatacollection,filteringandpre-processingarenearlyidentical;however, we do not require a set of input features. These features areautomaticallylearnedduringthetrainingprocess.

D4.2

2828

Figure9-Architectureofthedeeplearningapproach.

ThroughtheimplementationofDLalgorithmsweexpectanimprovementnotonlyin the results but also in the scalability and automation capabilities of theclassifier, requiring only datasets properly labelled for its training and thuscreatingabetterabstractionoftheproblem.

Pre-Processing Thepre-processingisidenticaltotheprocessdescribedinSection0.Howeverweprependthearchitecturethatresultedinthetweetbeingcollected,aspresentedinTable7.

Table7-ExtensionofthePre-Processingstage.

Beforepre-processing: #0daytoday#OracleGlassFishServer4.1 -DirectoryTraversalVulnerability[webapps#exploits#Vulnerability...https://t.co/KBcnWUc355Afterpre-processing: zerodaytodayoracleglassfishserverfourpointonehyphendirectorytraversalvulnerabilitywebappsexploitsvulnerabilityAfterprependingrelevantinfrastructure: oraclezerodaytodayoracleglassfishserverfourpointonehyphendirectorytraversalvulnerabilitywebappsexploitsvulnerability

Neural networks employed InourDeepLearningapproachfortheclassificationtaskweusedaConvolutionalNeuralNetwork(CNN)whichismostlyknownforitscapabilitiesandoutstandingresultsincomputervision.BeforedefiningwhatexactlyaCNNarchitectureisweshouldelaborateonhowitmaybefavouredinsteadofaMLPneuralnetwork.MLP networks are fully-connected: each neuron from a previous layer isconnected to every neuron from the following layer, thus resulting in a largenumber of parameters as the input increases in size. CNNs offer a moresophisticatedandlocalapproachthroughitsconvolutionandpoolinglayers.AbasicCNNconsistsofaConvolutionalLayer,aMax-PoolingLayerandanOutputLayer,plusadditionalnon-linearactivation functions that arepresentbetweentheselayers.Inour case, theConvolutionalLayerof thenetworkexpectsN input tensors,Nbeing the number of tweets. Each tensor has three dimensions [width, height,

D4.2

2929

depth].Consideringtweetsasinputs,thesedimensionscorrespond,respectively,tothesizeofthefeaturevectorsrepresentingeachword(width),tothenumberofwordsinatweet(height),andtothenumberofvectorrepresentationsbeingused(depth).Thislayeroutputsalsoanumberoftensorswiththreedimensions,calledfeaturemaps.Eachfeaturemapiscreatedbycubic[width,height,depth]convolutionalfilters,whoseparametersarelearnedautomaticallyduringtraining.Thesefiltersareslid(height,width,depth)acrosseachinputtensortoproduceafeaturemap.InFigure10wehaveanexampleofthisoperationconsideringonetensor(onetweet)withonlyonevectorrepresentationforeachword(depth1)andalsoone3-by-3-by-1 filter.The large squarematrix representsone tweetwith 5words(height)where eachword is represented by one (depth) dimension 5 featurevector(width).Thecoloured3-by-3squaresrepresentthefiltersthatareslidoverthetweettoproduceonefeaturevalueateachstep(differentcolours).Oncethesliding operation is over, a complete feature map is produced for each tweet(lower-rightsmallmatrix).

Figure10-Exampleofaconvolutionoperation.

On the Max-Pooling Layer, we take the resulting feature maps from theConvolutionalLayerandapplyamaxoperationwithaselectedwindowsizeandslideacrossthefeaturemapsretainingonlythemostrelevantfeatures.Ifwewereto apply amax operation to the featuremap computed in Figure 10, then theresultingneuronwouldhaveavalueof4.Finally,intheOutputLayertheresultsfromtheMax-PoolingLayerareconcatenatedandfedtoafully-connectedneuralnetwork whose output connects to a softmax function that finally outputs aprediction.ThewholearchitectureisalsoshowninFigure11.

D4.2

3030

AlthoughCNNs aremostly known to be the state-of-the-art inmany computervision tasks, there is also recent work which efficiently uses modified CNNarchitecturestosolveNLPtaskswithgreatresults.

Experimental setup The next subsections detail the specificities of the experimental setup used tovalidatethedeeplearningapproachdescribedabove.

Dataset ThedatausedisthesamefromTable4,dividingthedatainthreeDatasets(D1,D2, D3). Training and evaluation was done using D1 through a 10-fold crossvalidationmethodology.Differently fromwhatwasdone for theSVMandMLPapproaches, only the full architecturewas considered, i.e., we consider the ITinfrastructureinthelastrowofTable3andthedatasetsdescribedinthelastrowofTable4.

Neural network design Forourtask,weintendtoconverteachwordinatweetintoavectorofnumbersand stack them to form a matrix which is fed to the CNN. We based ourimplementationontheworkofYoonKim[KIM14].Figure 11 presents the architecture. Our network begins with an EmbeddingLayer, followedbytheConvolutionalLayercontainingseveraldifferentfilterswith the samedepth, aMax-over-time-pooling Layer, and finally theOutputLayerwhichappliesdropoutandasoftmaxfunctionforclassification.

Figure11-ConvolutionNeuralNetworkforSentenceClassification(basedon[KIM14]).

TheEmbeddingLayercomputesoneormorenumericalvectorsforeachwordofinputtweets.Foreachtweettheembeddingfunctionproducesonematrixwhich

D4.2

3131

constitutes an n-by-k-by-l representation of the tweet, n being related to thenumberofwordsintweets,thereforedefinedbythelongesttweetfoundinthetraining set, k representing the number of columns, corresponding to thedimension of the embedded vectors, and l corresponding to the number ofembeddingvectorscomputedforeachword.In the Convolutional Layer we have several filters with different heights,although every filter has the same width and depth, such that all filters willproducethesamenumberoffeaturemaps.Asanexample,inFigure11wehaveamatrixwhichhasaheightof11vectors(11wouldhavebeenthelongesttweetfoundinthetrainingdataset)andawidthof10(eachwordisrepresentedbya10-dimensional vector). The example tweet has only 10 words, therefore theremainingvectorispaddedwithapre-specifiedvalue.Byconsideringthatthebluefilterhasaheightof2andadepthof3,asitslidesdownthematrixitproduces3nodesforeverysequentialcombinationof2words,resultingin3featuremaps.Then, throughaMax-pooling-over-timeoperation,eachof these featuremapsgetsreducedtoitsmaximumvalue,andthemaximumvaluesofallfeaturemapsareconcatenated.Finally, these nodes are sent to theOutput Layer wherewewill first apply adropoutfunctionwhichallowsthenetworktogeneralisebetterandpreventover-fitting by randomly eliminating a fraction of nodes (according to a previouslydefinedpercentage).Then, thesevaluesentera fullyconnectedneuralnetworkwhichisconnectedtosoftmaxfunctionthatoutputstheprediction,classifyingifatweetis(ornot)mentioningathreattotheinfrastructure.Theinitialembeddinglayerhasfivealternatives:

1. CNN-rand:Usesrandomlyinitializedvectorsofdimension128whicharetunedduringtraining.

2. CNN-static:Usespre-trainedvectorsbuttheseremainunchangedduringtraining.

3. CNN-non-static:Usespre-trainedvectorswhicharefurthertunedduringtraining.

4. CNN-rand-300:AsCNN-randbutwithdimensionextendedto300.5. CNN-multichannel:Acombinationofpreviousmodels(e.g.staticandnon-

static, or non-static with different pre-trained vectors). This implies adepthlargerthanoneintheinputtensors.

For the pre-trained vectors, we used a word2vec [MIK13] language modeldevelopedbyGoogle.Thismodelcontains3millionwordsandphrasestrainedonroughly100billionwords collected fromaGoogleNewsdataset.Eachof thesevectorshasadimensionalityof300.Giventhatthesemodelswerenottrainedonsecurity-relatedtext,thereareseveralcommonwordsfromthesecurityfieldthatarenotpresentinthismodel,andsotheirvectorsareinitializedrandomly.

D4.2

3232

Results

Variations Forafirstanalysiswesoughttoanalysethefivewordembeddingalternatives.Wechose the initial set of hyper-parameters from Branco’s work [Branco17]. Asshown in Figure 12, the alternative that achieved the best balance of resultsconsideringdatasetsD2andD3wastheCNN-non-staticalternative.Assuchweuseditfortheremainingwork.

Figure12-Comparisonofthefivemodelvariants.

Design variables and hyper-parameters Besidesthemodelparametersthatareestimatedbythetrainingalgorithm,theCNNmodelhasseveraldesignvariablesandhyper-parameters.Weconductedanexperimenttoselectappropriatenumbersoffilters,theheightanddepthoffilters,andthedropoutrate.Thisexplorationresultedintestingatotalof1945models,whosedesignvariablesandhyper-parameterswerevariedasdescribed:

• Numberoffilters:variedfrom2to34;• Filterheight:tounderstandtheadvantagesofusingsmallerorlargerfilter

heights, 3 cases were considered, small, medium and large, where theheightsdependedonthenumberoffilters:◦ Forthesmallcase,ifthenumberoffilterswas2theheightswouldbe

[2,3],ifthenumberoffilterswas3,theheightswouldbe[2,3,4],andsoon:[2,3,4,5],[2,3,4,5,6],…[2,3,…,35];

◦ Forthemediumcase,ifthenumberoffilterswas2theheightswouldbe[18,19],ifthenumberoffilterswas3theheightswouldbe[17,18,19],ifthenumberoffilterswas4theheightswouldbe[17,18,19,20],andsoon:[16,17,18,19,20],[16,17,18,19,20,21],…,[6,7,8,…,28,29,30];

D4.2

3333

◦ Forthelargecase,ifthenumberoffilterswas2theheightswouldbe[34,35],ifthenumberoffilterswas3theheightswouldbe[33,34,35],andsoon:[32,33,34,35],…,[11,12,…,34,35].

• Filterdepth:variedthefollowingset:{8,16,32,64,96,128,192,256};• Dropout:variedwithinthefollowingset:{0.33,0.5,0.66}.

To summarize the results,we found that themodels in the small case of filterheightsusinghigher filterdepthsperformedbetter simultaneously inTPRandTNR.Thethreebestmodelswerethefollowing:

Table8-Thestructureofthebestmodelsobtained.

Dataset2 Dataset3Models TPR TNR TPR TNR

Few (7) filters of small height withdepth192,anddropoutrateof0.33 0.9278 0.9762 0.9142 0.9442

Many(26)filtersofsmallheightwithdepth32,anddropoutrateof0.66 0.9319 0.9735 0.9099 0.9423

Few (11) filtersof smallheightwithdepth96,anddropoutrateof0.33 0.9341 0.9721 0.9143 0.9436

Architecture expansion Besides the experimentation with hyper-parameters we also modified thenetworkarchitecturebysimplyaddingafullyconnectedlayerwithrectifierlocalunit(ReLU)activationfunctions,betweenthenodesresultingfromthemax-over-time-poolinglayerandtheoutputlayer.Wetookthelastmodelpresentedinthetableaboveandcomparedtheresultswhentheadditionallayerwasadded.TheresultsareplottedinFigure13.Theresultsaresimilar,althoughanimprovementinTPRwasnoticed.

Figure13-ComparisonofamodelwithoutandwithanadditionalFully-connectedLayer.

D4.2

3434

3.2.1.4 Clustering SinceTwitteruserscanretweetortweetaboutthesamesubject,itisexpectedthatthesystemwillcollectmanysimilartweets.Thisimpliesthattocoverinformationabout the complete IT infrastructure and to verify the validity of tweets, theanalyst may be required to manually inspect a large amount of redundantinformationforeachthreat.Toalleviatethisburden,clusteringisusedtogroupsimilartweetsthathavebeenclassified during a certain time span as relevant for the protection of the ITinfrastructure. Ideally, the information collected over the specified time spanabout a specific threat is summarized by one cluster, from which a singlerepresentativetweet–theexemplar–isselectedforpresentationtotheanalyst.Theexemplarisselectedbyfindingthetweetclosest(intheEuclideansense)totheclustercentroid.For the purpose of grouping the results, the k-means clustering [MAC67]algorithmwas applied to the tweet’s feature vectorsobtained usingTF-IDF.k-meansisawidelyusedclusteringalgorithmthathasprovidedgoodefficiencyandempiricalsuccessoverthelast50years[JAI10].However,k-meansiscommonlyemployedforexploratorydataanalysis,notautomatictextdatasummarization.Therefore,wederivedamethodfordoingsuchsummarization.

Methodology

k-means application strategy: The k-means clustering algorithm requires the specification of the number ofclustersK,whichisunknowninthiscase,i.e.,atagiventimewedonotknowforsurehowmanypotentialthreatstoourinfrastructurearebeingdiscussedbyoursources.Weimplementedastrategytofindtheso-calledelbowpoint[TIB01],i.e.,thepointbeyondwhichbyincreasingKthereisnosignificantimprovementinametric for evaluating the clustering. It assumes that at least two clusters arerequiredandthattheinitialclusterpositionsarechosenrandomly.TheprocedureautomaticallydeterminesK,thusavoidingthespecificationofathresholdtofindthe elbow point or the visual inspection of thewithin-class-variance versusKgraph.The method exploits the knowledge that the Sum of Squared Errors (SSE)decreaseswithincreasingK(downto0whenK=N,Nbeingthenumberofpoints),and the variability that is introduced by the randomness of the initial clusterpositions.Let𝐾 = 𝐾1,···, 𝐾OP/,and𝑖 = 𝑖 + 1.Foreachsuccessivevalueof𝐾Q ak-means model𝑀Q is trained, thereby a corresponding SSE value is obtained,denotedby𝜀Q .Ask-meansconvergestoadifferentlocalminimumeachtimeitisexecuted,different random initializationswould result indifferentvaluesof𝜀Q ,therefore a variance 𝜎Q is associated to 𝜀Q . Additionally, let 𝛿Q = 𝜀Q + 1 −𝜀QrepresenttheSSEdifferenceobtainedbyconsecutivevaluesofK.

D4.2

3535

ForsmallvaluesofK(smallnumberofclusters),theconsecutivevaluesof𝜀Q areexpectedtodecreasebyarelativelylargeamountasasignificantimprovementintheclusteringisexpected.Soitmaybeassumedthatuptoacertainvalueof𝑖,𝜀Q + 1 < 𝜀Q , independentlyof the random initialization,because𝛿Q compensatesfor the variances 𝜎Q and𝜎QX/ . When 𝑖 reaches a sufficiently large value, theabsolutevalueof𝛿Qbecomessmallenoughsothatitssignmaychangeduetothevariances𝜎Q . and𝜎QX/ . More precisely, the decrease in𝜀 becomes of the sameorderofthevariance𝜎Q .Atthisiteration,𝐾Q isselectedasthenumberofclustersbeyond which no significant decrease in𝜀Q will be obtained, meaning that nosignificantimprovementisachievedbyaddinganothercluster.Inpractice,thevalueofKhastobeincreaseduntil𝛿Q becomessmallerorequaltozero.ThetechniqueforfindinganadequateKisshowninAlgorithm1.

Algorithm1-k-meansapplicationstrategy.

Thek-meansapproachwastestedand,althoughsomesmallclustershadonlyverysimilar tweets, some large clusterswereobtainedcontainingunrelated tweets.Thecausemightbetwo-fold.Ontheonehand,k-meansassumessphericalclustersand tends to produce clusters roughly of the same size, which might not beadequate.On theotherhand the strategy to find thenumberof clusters isnotguaranteedtofindthebestnumberofclusters.Inordertocounteracttheeffectsofwrongassumptionsondataandofawrongnumberofclusters,aclustercohesionmeasurewasusedtoquantifyhowcloselyrelated the elements of a cluster are. Such measure enables the validation ofclustersasfinalwhenitindicatesahighlevelofcohesion.Byusingthecohesionmeasure,after theexecutionof thek-meansalgorithm,acertainnumberoffinalandnon-finalclustersareproduced.Non-finalclustersstillhavetobefurtherseparatedintosmallermeaningfulclusters,ideallyrelatedtoasinglethreat.Thisisaccomplishedbyre-clusteringthefeaturevectorsassociatedwithtweetscontainedinnon-finalclusters.

Cluster cohesion measure Cluster cohesion and cluster separation are two concepts used to assess thevalidityofapartitiongeneratedbyaclusteringalgorithm.Mostfrequentlythey

D4.2

3636

are combined to produce a cluster validation index [ARB13]. The majority ofcluster cohesion algorithms are based on the distance to the cluster centroid,hencehavingapurelygeometricinterpretation.TheclustersbeinganalyzedaremadeofTF-IDFfeaturevectorswhoseelementsare related to specificwords. In this case, cohesion could also be analyzed bylookingat thewordsthat thevectors inaclusterhave incommon.This idea isclearlymorerelatedtothesimilarityoftweetswithinaclusterthanthegeometricdistancetotheclustercentroid,whichisinsensitivetothelocationofelementsinthesamecentroid-centeredhypersphere.Italsoreflectsamoreend-useorientedorcontext-basedclustervalidationapproach,whichhasbeenarguedtobemoreeffective[GUY09].Toreinforcetheone-to-onerelationbetweenclustersandsecuritythreats,plustheend-useorientationoftheapproach,thecohesionmeasureshouldbeabletodetect clusterswhose feature vectors relate to tweets having the same theme.Assumingthatthethemeisexpressedbyaminimumnumberofwordsthatappearin all tweets, the proposed context-based cohesion measure – namedWithin-clusterThemeSimilarity (𝑊𝑇𝑆) - is givenby the ratioof thenumberofwordssharedbyallthecluster’stweets(asgivenbyTF-IDF),𝜔,tothesizeofthesmallesttweetinthecluster(innumberofwords),𝑤\:

𝑊𝑇𝑆 =𝜔𝑤\

𝑊𝑇𝑆is0ifnowordsaresharedbyallthetweetsinacluster,andis1ifalltweetssharethewordsofthesmallesttweetinthecluster.Thismeasureassumesthatifalltweetsinaclustershareasufficientlyhighnumberofwords,thenthetweetshaveasimilartheme.FortheclusterseparationindexweusetheJaccardindex[ZAK14],computedonthebasisofthesetsofwordssharedbyallthetweetsofeachcluster.DenotingsuchsetsbyAandB,foranytwoclusterstheindexisdeterminedas:

𝐽 =|𝐴 ∩ 𝐵||𝐴 ∪ 𝐵|

ThiscorrespondstothedivisionofthenumberofwordsincommontoAandBbythenumberofwordsofAandB.Theloweritsvalue,themoredistinctthetwosets,hencethemoreseparatedthetwoclustersare.

Re-clustering If𝑊𝑇𝑆 ≥ 𝜆(λbeingaspecifiedthreshold),theclusterisconsideredfinalandissaved.Ifonthecontrary,𝑊𝑇𝑆 < 𝜆,thenthatclusterismarkedasnon-final.Allthefeaturevectorsofnon-finalclustersaregatheredintoasinglecollectionthatis clustered again using Algorithm 1. Non-final clusters are likely to includeunrelateddata,sotheyareallmergedtoallowtweetsof thesamethemetobegrouped after the clustering operation. This re-clustering technique isrepresentedinAlgorithm2.

D4.2

3737

Algorithm2-Re-clusteringalgorithm.

Whentherearenomorefeaturevectorstore-cluster,afinalstepistakentoobtaineachcluster’s exemplar tweet, i.e., the tweets thatwillbeshown to theanalystrepresentingalltheclustersobtained.ThisisdonebychoosingthetweetwhoseEuclideandistanceistheshortesttothecentroidofeachcluster.Anexampleofageneratedcluster(anditsexemplar)ispresentedinTable9.

Table9-Anexampleofaclusteranditsexemplar(inbold).

Cluster:Vuln: Linux kernel CVE-2013-7446 Use After Free Denial of ServiceVulnerabilityhttps://t.co/h3a8JjokqHVulnerable:UbuntuUbuntuLinux\xe2\x80\xa6Vuln: Linux kernel CVE-2013-7446 Use After Free Denial of ServiceVulnerabilityhttps://t.co/0MdIsxpOm5#Vuln: #Linux kernel CVE-2013-7446 Use After Free Denial of Service#Vulnera-bilityhttps://t.co/NuEJ2bdV70#bugtraq#Vuln: #Linux kernel CVE-2013-7446 Use After Free Denial of Service#Vulnera-bilityhttps://t.co/NuEJ2bdV70#bugtraq#cybersecurity Vuln: Linux kernel CVE-2013-7446 Use After Free Denial ofServiceVulnerabilityhttps://t.co/zKPhIIizqk#infosecVuln: Linux kernel CVE-2013-7446 Use After Free Denial of ServiceVulnerability:LinuxkernelCVE-2013\xe2\x80\xa6https://t.co/veUzn1BPy6#infosec#vulnerability #security : Vuln: Linux kernel CVE-2013-7446 Use After FreeDenialofServiceVulnerabilityhttps://t.co/dA32Wq97Rz

Experimental setup Spark’sMLlibimplementationofk-meanswasusedinoursystem.Thisalgorithmwasexecutedwith fifty iterations,minimumof twoclusters,andtheremainingparameterswiththeirdefaultvalues.ThebestclassificationmodelsobtainedintheexperimentalworkusingSVMsandMLPneuralnetworkswereevaluatedforclassificationondatasetsD2andD3.Inthiscase,clusteringwasperformedonthesetoftweetsclassifiedaspositivebytheselectedmodels.

D4.2

3838

Aftertheinitialclusteringofthetweets,theclustersweremarkedasfinalornon-finalbyusingtheWTScontext-basedcohesionmeasure.Thiswasaccomplishedbyapplying the threshold𝜆 = 2/3toWTS.Thisvaluewas selectedaftersomepreliminaryexperiments,reflecting the rationale that two tweets canbe in thesameclusterifandonlyiftheyshareatleasttwothirdsoftheirwords.There-clusteringprocedurewasthenappliedrecurrentlytothedataofnon-finalclustersuntilallclustersweredeemedfinal.We compare our data presentation strategy with a naïve OSINT informationdiscoverystrategy,similar to theonesusedby freelyavailable toolsandSIEMscapable of collecting OSINT. Instead of using a classifier, the naive approachconsidersthatatweetisrelevantifitcontainsatleastoneinfrastructure-relatedkeyword,andatleastonewordrelatedtosecurityconcepts.Toautomaticallyselectthesecurityconceptkeywordsforthenaivemethod,werequired a methodology capable of capturing words generally used in threat-relatedtweetsandofavoidingwordsrelatedtospecificthreats.Inalargesecuritytweet corpus thesewordswill have lowTF-IDF values; therefore,we used thefollowingmethodologytoobtainthenaivesecurityconceptkeywordset:alistofdocumentsisobtainedbyselectingalltweetslabeledaspositivefromalldatasets;afterthat,weremovedstopwords,appliedtheTF-IDFmethod,andselectedthewordswith TF-IDF value lower than a given threshold τ ; finally, the listwasmanually filtered forsecurity-irrelevantcontent(suchasnumbers).Thevalues0.1,0.2and0.3wereconsideredforthethresholdτ,andafteravisualinspection,𝜏 = 0.2waschosen.Thischoicewasduetotheprovisionofthelargestamountofgenericwordswithout showingwords related to a specific context. The naivekeywordsetcorrespondingto𝜏 = 0.2ispresentedinTable10.

Table10-Thewordsusedinthenaïvefilter.

access,acl,admin,advisory,allow,arbitrary,aslr,assurance,attack,auth,buffer,bug, bypass, certificate, code, command, corruption, csrf, cve, cyber, denial,deployment, dereference, disclosure, execute, exploit, hack, heap, identity,injection,interception,leak,overflow,privilege,remote,root,scripting,security,stack,threat,unauthenticated,vuln,xssSince the keyword setwas obtained from the positively labeled tweets of thedatasetsemployedinthiswork,itisnaturallybiasedbythosetweets.Hence,thisshould be a close to optimal naive approach for the data sets employed. If thekeywords were chosen from a larger set of security-related tweets, the naivemethodwouldsurelycapturemoretweetsconcerningawidersetofthemes,thusincreasingtheamountofdatatobepresented.

Results The clustering stage in the proposed tweet processing pipeline focuses onreducing the amount of information shown, while ensuring that it only hidesrepeated data. The frequency with which the clustering procedure must beexecuted is defined by the amount of tweets appearing at the output of the

D4.2

3939

classificationmodels,whichinturnisinfluencedbythesizeoftheinfrastructureandbytheaccountsusedtocollectdata.Findingagoodpolicytodecidewhentoclusterdataandprepareitforpresentationtotheanalystisbeyondthescopeofthiswork.Nevertheless,itisimportanttodemonstratethevirtuesofadoptingaclustering-basedpresentationofdatatotheSOCanalysts.Todothat,datasetsD2andD3weresplitinconsecutiveweeksandtheclusteringprocedurewasexecutedonthetweetsclassifiedaspositiveineachofthoseweeks.Theclassificationwasperformedbythebestclassifiersforeachinfrastructure:SVMforA,C,andD;andanMLPforB.Table11presentstheresultsobtainedperweek.Besidesthenumbersofpositivetweets (N), we present results related to the initial clustering and to the finalclusteringobtainedafterourre-clusteringmethod.Theobjective is tocompareclustering evaluationmetrics and highlight the advantages of the re-clusteringprocedure.

Table11-Resultsobtainedbyapplyingtheclusteringstageofthetweetprocessingpipeline.

Initialclustering Finalclustering(afterre-clustering)Week N Kr WTS Kr WTS ReC % J1234567891011121314

14140334842203172165294351184187229

6369579853711114

0.750.450.590.830.630.810.560.60.620.640.570.620.750.35

2230282725225447558070686159

0.950.910.930.980.850.890.830.860.860.900.880.860.920.86

345445131091112151012

68.869.868.384.452.152.224.827.332.725.219.235.031.824.9

0.430.460.430.360.430.410.460.500.400.500.480.500.440.45

Forbothcases,wepresentthenumberofclustersgenerated(Kr),andtheaverageWTSclustercohesionmeasure.Forthere-clustering,wealsoshowthenumberoftimesthere-clusteringprocedurewasexecuted(ReC),thepercentageoftweetsshown to the analyst (%), and the maximum value of the Jaccard index (J)computedforallpairsofclusters.Thelattermeasureshowwelltheclustersareseparated,whichtranslatestothedifferenceofthemesamongthem.TheWTS results show that the similarityof elementsof each cluster improvessignificantly (between 9% and 59%, 30% on average) when our re-clusteringalgorithmisused.Naturally,thisimprovementisaccompaniedbyahugeincreasein thenumberofclusters(e.g.,upto27×more forweek10).Nevertheless,ourapproachprovideshighthemesimilarityamongthetweetsofeachcluster,andalwayshigherthanapplyingasingleclusteringinstance.

D4.2

4040

ThemaximumJaccardindexvaluehighlightsthedifferenceinthethemesofeachcluster.The two results indicate thatdifferent clusters (large separation)werefound,whosetweetsweresimilar(highcohesion)inthetheme.Thecolumnwiththe percentage of tweets shown, reveals that in general there is a significantreduction in the number of tweets presented to the analyst by applying theclusteringprocedure.Until the6thweekthevalue isratherhigh,yet justifiablebecausethereweremanysmallclustersmentioningdifferentthreats.TheresultspresentedinFigure 14 highlight the importance of using the pipeline herein proposed andreinforcestheimportanceofitsclusteringstage.Thefigureshowsthereductioninthenumberoftweetsthatwouldhavetobeanalyzed,whencomparedtothenaiveapproachdescribedinSection0.

Figure14-Thenumberoftweetspresentedusingthreedifferentapproaches.

TheresultsclearlyshowtheneedofefficientOSINTretrievaltools.Withanaivekeyword-based approach the number of tweets marked as relevant would beextremely high, thus rendering the approach useless to the SOC analysts. Theintroductionofatrainedclassifierdecreasestheamountofinformationby35%.Byattachingaclusteringstage,only20.3%oftheinformationhastobeshowntotheanalysts,whichisasignificantimprovement.

3.2.1.5 Ongoing and future work

Classification Regarding the selectionof themodels’ designvariables and hyper-parameters,furtherexperimentsshouldbeconsideredinordertooptimizethemodeltopologyand hyper-parameters. For this purpose, we envisage using a multi-objectiveevolutionaryalgorithm.Oneof the limitationsofourcurrentapproachregardingpre-trainedvectors isthatthesevectorswerenottrainedoninformationsecuritytext,thustheymissalotofvaluablerelationspresentinwordsspecifictotheinformationsecurityfield(e.g. “escalation”, ”remote”, ”exploit”).Thedevelopmentof suchamodelwouldrequirealargecollectionofdatabutcouldoffergenerousimprovementsinanycybersecuritydatarelatedtask.

D4.2

4141

Futureworkalsoincludestheimplementationofarecurrentneuralnetwork,mostlikelysomevariationoftheLSTMnetworkarchitecture,acombinationofbothofthesearchitectures(LSTMandCNN)andacomparisonofresultsbetweenthem.

Information extraction Followingtheclassificationproblem,anotherobjectiveofourresearchistouseDeepLearningapproachestoperformNamedEntityRecognition(NER)taskstoextractIoCsfromaclusteroftweets.Namedentityrecognitionseekstolocateandclassifynamedentitiesintextintopre-defined categories. In our case these categories could be cyberattacks,vulnerabilities,IPaddresses,institutionnames,threatactors,orproductnamestomentionafew.Wearecurrentlylookingatthestate-of-the-artregardingtoolsandNLP libraries that have NER models (e.g. Natural Language Tool Kit (NLTK),Stanford'sNER,spaCy),howeverthemostlikelyscenariowillbeimplementingacustommodelbasedonrecentworkondeeplearningNERmodels.Forinstance,fromaclusterincludingthetweet“MicrosoftEdgeCVE-2016-0161RemotePrivilegeEscalationVulnerability”weintendtheNERmodeltooutput:

• Product:“MicrosoftEdge”• CVE:CVE-2016-0161• Vulnerability:RemotePrivilegeEscalation

ThisinformationwouldbeformattedintoanIoCandpresentedtoaSOCteam.

3.2.1.6 Pragmatics for a SOC deployment Although the results presented in the previous sections demonstrate thefeasibilityandusefulnessofourapproachinbuildingathreatawarenesssystemforSOCs,therearestillasetoffeaturesthatneedtobeaddressedtoenableitsintegrationinaproductionenvironment.

Twitter as OSINT WhenusingTwitterasa cybersecuritydata source, it is important to considerwhatwouldhappenifsomeofthemonitoredaccountsfellunderthecontroloftheadversary.Inanutshell,twothingscanhappen[SAB15]:(1)theadversarymaynot tweet about the threats he is interested to exploit using the accounts hecontrols;or(2)theadversarymaycreatetweetswithfalsethreats,tomakeSOCanalystswastetheirtimeinsolvingpotentialnon-existentproblems.Bothattacksshouldnotbeasignificantproblemaslongastheamountofaccountscontrolledby the adversary is relatively small, and the analysts take into account thereputationoftheaccountsmonitoredbythesystem.

Training the system Ourapproachrequiresthecreationoflabeleddatasetsfortrainingtheclassifiers.To do that, the SOC analysts need first to configure the keywordsdefining theinfrastructure, possibly separated in different parts. Ideally, such parts should

D4.2

4242

represent the asset types to be monitored, thus matching the asset modeldeployedintheSIEMsystemoperatedbytheSOC.AsecondconfigurationstepistodefinetheTwitteraccountsthatwillbemonitored(e.g.,theonesinTable5).Afterthosetwosteps,thesystemshouldpresentallfilteredtweetsasiftheyarerelevant,andabuttonfortheanalysttomarkatweetasirrelevant.4Noticethat,to avoid bias, it is important to inform the analysts that the system is undertraining.Whenasufficientnumberofpositively-labeledtweetsarecollected,theclassifierscanbetrainedinbackgroundandthenplacedinoperation.

Re-training the system Asmentionedbefore,itisexpectedthattheclassifier’squalitydecreaseswithtimeas the operational data gets progressively different from the training data. Tomaintaintheutilityoftheclassifiersinuse,itisimportanttominimizethiseffect.Incrementaltrainingisatechniquethatcanbeusedforthispurpose,wheretheclassifier’smodeliscontinuouslytrainedwithnewlabeledexamples[GEN15].Byconstantly trainingthemodelwiththe latestevents, it isconstantlyadaptedtochangesininputformat(inthiscase,changesintweetformatorlanguage).Anotherpossibilityistoreplacethemodelwithanewmodeltrainedwithonlythelatestdata,e.g.,thelastthreemonthsoftweets.Thiswaythemodelcontainsonlythelatestdataformats,meaningthatitisalwaysadaptedtothecurrentevents,andolddataformatswillnotcauseimpactontheclassifier’squality.

Creating new infrastructure parts If one wants to add a new keyword set representing new assets in theinfrastructure,theprocedurewouldbesimilartothetrainingdescribedbefore.The difference here is that themanual labeling phase to generate the trainingdatasetforthenewinfrastructurehappensduringnormalsystemoperation.

Changing keywords Adding or removing keywords from certain datasets require retraining theclassifier.Removingakeywordrequiresremovingthetweetsthatwerefilteredbythiskeywordandretrainthemodelwithoutthem.Toaddakeyword,oneneedsfirst to complement theexisting labeleddataset (in thesamewayasdescribedbefore)withtweetsrelatedtothenewkeyword,andthenretrainthemodelwiththereformulateddataset.

Changing the monitored accounts Changing the setofmonitored twitteraccounts isnotaburden for thesystem,since the structure of threat descriptions is expected to be similar across allsecurityaccounts(ourexperimentsconsideredthatforD2andD3).

4The“irrelevant”buttonmustalwaysbemadeavailable,evenwhenthesystemisoperating(andnottraining),inordertocollectwronglyclassifiedtweetsforfutureretraining.

D4.2

4343

IoC generation for SIEM integration SOCanalystsusuallykeepanintegratedviewofthesecurity-relatedeventsoftheirIT infrastructurethroughaSIEMsystem.SIEMsusuallyprovidemechanismstoreceiveeventsfromexternalsourcesnotdirectlymanagedbythem.Forexample,integration with HP ArcSight is achieved through the creation of a specialcomponent called a Connector, while in Splunk the same functionality isaccomplishedwithaModularinput.InordertointegrateoursystemwithaSIEM,itisnecessarytogenerateSIEMeventsfromtheclusteringstageoutput.Togeneratethis,thefirststepconsistsinextractingtheinformationineachclusteroftweets.Forthat,theinformationintheexemplartweetoftheclusterplustheinformation contained in hyperlinks that may exist in the tweet, have to beextracted. This could be done automatically by employing natural languageprocessingtoolssuchasnamed-entityrecognizers.Oncethisisaccomplished,theinformationhastobestructuredaccordingtoaformatthatcanbefedtotheSIEM.Therearea fewopen-source formats forpublishingIoCs, for instanceOpenIoC,STIXandMISP.Figure15showsanexampleIoCinMISPformat(theplatformforIoCsharinginDiSIEM–seeSection4),generatedfromatweetcluster.

Figure15-ExampleofanIoCgeneratedfromatweetexemplarinMISTformat.

D4.2

4444

Theexemplartweet is included inthevalue field,whilea link for inspectingalltweetsintheclusterappearsinthecommentfield.ThegeneratedIoCscanthenbe fed to the SIEM as external events, providing plenty of fields that could becorrelatedwithothereventscollectedbytheSIEM.

3.2.2 Listening247 Threat Predictor listening247isasolutionwhichoffersanalysisofunstructureddatafromvarioussourcesincludingblogs,socialnetworks,news,boards/forumsandotheropenlyavailabledataonthewebformarketresearchpurposes.ItusesaSoftwareasaService (SaaS) model that enables users to monitor the web for specificsubjects/topicswhileextractinginsights.Theplatformhasbeendesignedfororganizationstodiscover insightsnotonlyfrom social media but also from other online locations. Such systems areincreasinglyindemandbyseniormarketingexecutiveswhoarelookingforwaystosiftthroughfast-changingdataacrossgeographies,languagesandtimezones.This functionality makes it particularly useful as it simplifies the process toaccuratelyanalyseunstructureddata.Social listening helps organisations: accurately evaluate marketing campaigns;analysehotconversationtopics;discoverwhitespace/marketgaps;respondtonegative and leverage positive posts; and benchmark their shareof voicewithcompetitors.Intermsofinfrastructure,theplatformmakesuseofAmazon’scloudservicessuchas thestorageservice(S3)andelasticcompute(EC2).Thisallowscomplexandcomputationally intensive tasks to be offloaded, allowing them to utilise thesurplus processing andstorage capability in the cloud.The data is analysed toextractsentiment,emotionsandtopicsforbrands,products,organisationsorevenkeyindividuals.Thisinformationanalysisstep,combinedwithmetadatacomingfrom the online posts, is essential for the creation of insightful reports thatdescribewhatissaidonthewebaboutthesubjectofinterest.In terms of processing, listening247 is a distributed platform. It uses HadoopMapReducefordataprocessingandimplementsstate-of-the-artmachinelearningalgorithms for data analysis. Raw and analysed data are stored in scalabledistributeddatabasesthatofferaflexiblequeryAPIusedforourreportingneeds.Theback-endarchitectureisdevelopedusingPython,andrecentlythefrontendhasbeenrebuiltusingDjangoandJavaScript.The listening247platformis ideal forDiSIEMbecause ithasapipelinethathasbeenperfectedbyyearsofresearchanddevelopmentspecificallyforthepurposeofderivingactionableinsightsfromnotonlyunstructuredtext,butalsoimages.ThispipelinecanbeextendedtoenhancetheeffectivenessofSIEMsinpreventingcyber-attacks by providing themwith actionable insights from various OSINTsources.Additionally,theplatformhasbeenusedwithsuccessinseveralmajorlanguages including English, Spanish, German, Russian, Chinese, Japanese, andVietnamese.DigitalMR’snetworkofover250experiencedandtestedcurators(or

D4.2

4545

coders)worldwide, in addition to its industrial strength processes for “noise”removalanddisambiguatingpostsmakesthetrainingofmachinelearningmodelsstandoutintermsofperformance.

3.2.2.1 Architecture This section presents an overview of the listening247 threat predictor.Specifically, it highlights the data sources used, the architecture of the threatpredictor,andthedetailsofthemachinelearningmodelstrainedtomodelthreatsusingopenlyavailableinformationfromopensourcevulnerabilitydatabases,suchasExploitDB,VulnDB,andCVE,aswellasannotateddatafromonlinedatasourcessuchasTwitter,boards(forums)andblogs.Figure16illustrateshowvulnerabilitiesoforganisations’infrastructuresleadtoexploits,someofwhichareunknowntotheorganisation(i.e.,zero-dayexploits)andareusedby threat actors toattackanorganisation's infrastructure. It alsoillustrateshowthelistening247threatpredictorutilizesexistingannotateddatasources(suchasVulnDBandExploitDB)tolearntopredictthreatsandapplythatknowledge towards identifying threats found on various both OSINT andDarkWebsources.

Figure16-Anoverviewoftheentitiesandtheirrelationships,andtheroleofthethreatpredictor.

D4.2

4646

Anorganisation’snetworkisexposedtotheworldwhenconnectedtotheinternet,also exposing the vulnerabilities of their infrastructure. This infrastructureconsistsofnotonlyhardwaredevicessuchasrouters,butalsosoftwareandtoolsinstalledontheirmachines.Thesevulnerabilitiescouldbeknownbythesecurityteamsoftheorganisation,inwhichcaseitisnotaproblemasanupdateorpatchcan be developed to resolve them. However, threats to an organisation'sinfrastructurecome- toa largeextent - fromvulnerabilities thatareunknown.Thesearethetypeofvulnerabilitiesthatthelistening247threatpredictordetectsinordertogiveanearlywarningtotheorganisationbeforeanythinglandsinthewronghands.Thevalueofthisthreatintelligenceisdependentontime;itcanbevaluabletohackersbeforethevulnerabilityisknowntopeopletryingtopreventit(i.e.,zero-dayvulnerability),butitisuselessafterapatchorupdatethatresolvesthevulnerabilityhasbeenreleased.Thus,providingthisintelligenceinatimelymanneriscrucial.Toachievethis,thethreatpredictoristrainedtoalargeextenton open source vulnerability databases, aswell as some annotated data. Opensource vulnerability data sources make excellent training data, as these areannotated by experts in the field with descriptors, such as the targetedinfrastructure,affectedversions,themeans,andseverityofthevulnerabilities.Inaddition, these descriptors are also summarised and published on theTwitteraccountsofthesevulnerabilitydatasources,forexample@CVEnew,@ExploitDB.Trainingonthisdataandbeingabletopredictthesedescriptorsondatanotyetdiscoveredbythevolunteerscontributingtothesedatabases,isnotonlyvaluabletothesecurityteamofanorganisationbecauseitwillreportthedetailsofthesevulnerabilitiesinafamiliarformat,butitisalsoamorepracticalwayofscalingtheabilities of scarce cybersecurity experts on assessing potential cyber threats.ThereisashortageofcybersecurityprofessionalsanditisthegeneralbeliefthatleveragingArtificialIntelligencetoaugmenttheabilitiesofsuchprofessionalsisthesolution.Thelistening247threatpredictorwillintegratetotheexistingSIEMsthroughAPIcalls.ThisisalreadythemethodofintegrationusedvulnerabilitydatasourceslikeVulnDBandisgenerallyaflexiblewaytointegrateservicesinsoftware.TheAPIcallsallowforqueryingpreviouslygeneratedSTIXalertsstoredinAlertsDB,andalsopostingunstructuredtexttobeanalysedbythethreatpredictionmodel.

3.2.2.2 Novelties of the Cyber Threat Predictor The completed listening247 threat predictorwill leverage the strengths of theontology-basedtechniquesandmachinelearningtechniques,thuscombiningthebestofbothworlds.Specifically,thepipelinewillconsistofsomeofthestrengthsof theontology-based approaches - such as the extraction of entities and theirrelationshipandstoringthisintheformofaknowledgegraph,andthestrengthsofmachinelearningapproaches-theirabilitytolearncomplexpatternsfoundindata.The listening247 threat predictor will consist of two pipelines. The first willconsist of entity extraction and linking, which will then be enriched in anaggregation step with additional information from the second pipeline - the

D4.2

4747

machinelearningpipeline-thatwillpredictdescriptorsaboutthedatasuchasthetargetedplatform,exploittype/meansandtheCVSSseverity.Enrichedfactsaboutanentity(suchasJBoss)anditsrelationshipwithotherentities(suchasaremotevulnerability)willthenbestoredintheformofknowledgegraph.Thisknowledgegraphwillbuildvariousfactsovertime,includingtheirdiscoverytimeandCVE(ifanyisassigned)-astimeisjustasvitaltoknowifthesevulnerabilitiesstillposeathreat,orifanupdatehasbeenreleasedthatresolvesthem.However,themajormeritofthisapproachisnotonlythatitwillhavemoreenrichedalertsdeliveredtoSIEMs,but thatmachine learningalgorithmsforrelationalpredictioncanbeapplied to learn the entities and relationships represented by the knowledgegraph,andbeabletopredictthelikelihoodofacertainsoftwarehavingacertainvulnerability.Insummary,thesewillbethefeaturesofthethreatpredictor:

• Entityextractionandlinkingtoformfacts.• Enriching these facts with additional information about the targeted

infrastructure, exploit type/means and the CVSS severity, if not alreadypresentintheOSINTdatapoint.

• Storingthesefactsinaknowledgegraphthatinherentlyfusesthesefactsintoamachine-readableform.Thisknowledgegraphwillgrowwithtimeandallowforlearningandinferenceofnewcomplexrelationshipsthatarenotalreadyknown,suchasinferringthatacertainversionofasoftwarethatsharesdependencieswithitslaterversionwilllikelybeaffectedbyanalready known vulnerability. This, in fact, iswhatwe refer to as threatprediction,as it isarelationshipbetweenasoftwareandavulnerabilitythatwasnotalreadyknownandcanbetakenadvantageofbyhackers.

• The knowledge graph will be a valuable collection of searchable factsaccumulatedfromvariousOSINTsources.

3.2.2.3 Data Sources Thissectionprovidesabriefdescriptionofthedatasourcesthatwillbeusedforthelistening247threatpredictor.

Social media sources DigitalMRhastheabilitytocollectinformationfromavarietyofonlinesourcesformarket research, including Twitter, Facebook, blogs, forums, news and otheropenlyavailabledata.ThekeyinputsfordatagatheringisasearchquerymadeupofkeywordsthatdefineascopeforthedatatobegatheredfromthevariousOSINTsources. This is the first step to noise filtering, which involves disambiguatingkeywordsthatmightbehomonymstootherwords.Forexample,“Windows–theoperating system” (an infrastructure), will yield information about “windows”whichareusedinbuildings,amongothers.Formingaspecialisedquerywiththerelevant infrastructuresaskeywordswillnarrowdown the collection from thevastamountsofOSINTdataavailableontheweb,therebymakingtheamountofdatatobeprocessedmoremanageable.Sofar,aqueryhasbeenformedbasedontheinputsofthequestionnairefilledinbytheindustrialpartners,andestimatesofthenumberofrelevantpostsfortheinfrastructureslistedinthequerysuggest

D4.2

4848

that there could be at least 24million in the past year. These querieswill beupdated in time to cover more relevant terms that refer to commonly usedinfrastructures,andalsodisambiguatetermstoeliminatenoise.

Open source vulnerability databases Openlyavailablevulnerabilitydatabasesareanothersourceofdatawhichwerefound to be useful and also used in related studies, such asNISTVulnerabilityDatabase (NVD), ExploitDB, and vulnDB. These databases typically consist ofcollaborativelyaccumulatedreportsaboutvulnerabilities,andsomedescriptorssuch as the targeted platform, exploit type, CVE, and CVSS severity score. Inadditiontothese,italsousuallycomeswithcodeorconsole/iothatreproducesthevulnerability.Thisispreciselywhyopensourcevulnerabilitydatasourcesareavaluablesourceoftrainingdata.Fig.2showtheover30,000thousandtypesofexploitsfoundintheExploitDB.

Darknet Data fromthedarknetwillalsobegatheredthroughtheAPIsofdataprovidersofferingtheseservices,suchasWebhose.5Thiswillbeintegratedasanadditionaldatasourcewhichwillallowforgatheringdatafromthedarknet.TheuseoftheseAPIs, some of which also claim to have tools and techniques that allow forbypassing safeguards of some darknet forums, not only eliminates the risksinvolved with getting data from the darknet, it also offers more coverage ofpotentiallyrelevantcontent.However,accessingthissourceofinformationcomeswithadditionalcostchargedbythedataproviderswhichwillbebilledtotheclientinterestedinusingthisasanadditionaldatasource.

3.2.2.4 Cyber-Threat Modelling Thelistening247threatpredictorconsistsoftwopipelines;anentityextractionandlinkingpipelinethatproducesSPOtriples(i.e.Subject-Predicate-Object)asinontology-based approach to cyber threat prediction fromOSINT, and a secondpipelinethatusesmachinelearningmodelstopredictmeta-informationsuchastargetedplatformandexploit-type.During inferenceonunstructuredtextdata,SPO triples gets enriched with meta-information from the machine learningpipeline-whichisalsoreferredtoasthethreatidentificationinthissection.

Architecture The threat predictor takes advantage of the strengths of the two leadingapproaches commonly found in the literature [NUN16, MIT16, MIT17, SAP17,HOV12]byutilisingtwopipelines.Thefirstisapipelinethatextractstripleswhichconsist of the named entities (e.g. applications and vulnerabilities) andestablishingtheirrelationship.Forexample,thetweetinFigure17willhavetheextracted entities as {XSS, SVG, Tiki} and one of the triples will be (‘Tiki’,‘hasVulnerability’, ‘XSS’).Thesearestoredinaknowledgegraphwhichnotonlyaggregates data from the openly available OSINT vulnerability databases and

5https://webhose.io/data-feeds/dark-web/

D4.2

4949

otherOSINTdatasources,butalsoallowsforenrichingitwithinformationfromthemaven repository,whichmapsapplicationsdependencies.Fig.4shows theexample of a fact in the knowledge graph that leverages both the mavenrepository,andOSINTdatasources.

Figure17-AtweetdescribingaXSSvulnerability.

Figure18-Anexampleofaknowledgegraphobtainedfromarecordedfact.

As the knowledge graph grows, it makes for a valuable source of relationalinformation between infrastructures and vulnerabilities. This allows forextending the capability of the threat predictor bymodelling the relationshipsbetweenentitiestodiscoverthelikelihoodofnewrelationships[12].Thesenewrelationshipsareinfactpredictionsornewdiscoveriesofvulnerabilitiesbasedontheexistingknowledgebase.Forexample,predictingifApacheKafkahascertaindependencies in commonwith Elastic Search; orwhat is the likelihood that acertainvulnerabilitythatisalreadyknownforKafkawillalsoaffectelasticsearch.Thesecondpipelineconsistsofmachinelearningmodelswhicharemeanttofilteroutirrelevantdataandidentifythreatsbypredictingtheirtype,CVSSseverityandtheplatformtheyarelikelytoaffect(seeThreatIdentificationinFigure19).TheseSVMclassifiersare trainedon theopenlyavailablevulnerabilitydatabasesandannotatedOSINTdata.Thenoisefilteringmodelistrainedtofilteroutirrelevantdataandonlypassrelevantdataontothethreatidentificationmodel,andtotheEntityExtractionandLinkingpipeline.Thisstepnotonlyprotectstheprivacyofthepublicbyavoidingpersonalcontent,butalsosavestimeandcostthatwouldhave otherwise been used to process irrelevant - and therefore uselessinformation.Additionally,itprotectstheknowledgegraphfrombeingoverloadedwithirrelevantfacts.

D4.2

5050

Figure19-Anoverviewofthetwopipelinesthatwillbeusedforthreatprediction.

The threat identificationmodelwill consistofmultiple SVMmodels trained topredictthetargetedplatform,exploittype,andCVSSseverity.Thetrainingdataforthesemodelsarethedescriptionsofvulnerabilities,excerptsofcode,and/orconsoleinputs/outputsthatreproducevulnerabilitiesfromthereportsfoundonopensourcevulnerabilitydatasources.Thepredictionsofthemodelsfromthispipelinegetaggregatedwiththetriplesproducedbytheentityrecognitionandlinking parser, to form an informative fact (such as in Fig. 4). Depending on anumberoffactors,suchasthepresenceofthefactintheknowledgegraph,anditsdateofdiscovery-analertisthensenttotheSIEMintheformofSTIXdata.Thesevulnerabilityfactsarethenaddedtotheknowledgegraphusingtheknowledgegraph API and accumulate over time to constitute a resource that has fusedknowledgefromvariousOSINTanddarknetsources.The knowledge graph uses a schema-based approach which assumes thatrelationshipsnotalreadyinthegrapharepossible(i.e.anopenworldassumption-OWA).Thismeansthatthepropertiesofentities,aswellastheirrelationships,willcomefromapredefinedsetthatevolveswiththecomplexityofthedomain.This is valuableas it enables foradditional extensions tobe introduced to thisknowledgegraph,suchasaddingembeddingsforpropertiesofentitiesandtheirrelationship,thusenhancingsearchandclusteringcapabilities.

D4.2

5151

Intermsofhowthegrowingknowledgegraphwillbeused,DigitalMRintendstocontinuallyenhancethethreatpredictioncapabilitybytrainingensemblesofdeeplearningmodelsinspiredbyER-MLP(EntityRelationshipMultilayerPerceptron)andNTN(NeuralTensorNetwork)whichwill learn fromthe latent featuresofthese entities and their relationships. The resulting deep learningmodelswilllearn a score function𝑓(𝑥Qij) that predicts the existence of a relationship(𝑘)between two entities (i.e.,𝑖, 𝑗); for example, a new versionof anApacheHTTPServer(astheentity𝑖)withaknownvulnerability(astheentity𝑗),whichmightnotyetbeknowntoSOCteams,butpresentsarealthreattotheirinfrastructureasitcanbediscoveredandexploitedinthefuture.Thesedeeplearningmodelswillbeupdatedastheknowledgegraphgrowstoaccountfornewlyaccumulatedfacts,andovertimehavebetterpredictionaccuracy.

3.2.2.5 Experimental Results Inthissection,theexperimentalresultsofthemachinelearningpipelinefornoisefilteringandmeta-informationpredictionarepresented.Specifically,itshowstheeffectivenessofoptimisedSupportVectorMachines(SVMs)forthesetasks.

Noise filtering

Analysis of Dataset Thedatasetusedforthisexperimentconsistedofatweetsetannotatedaccordingtoitsrelevanceregardingtoacertaingroupofinfrastructures,byaPhDstudentof one of our academic partners. Each tweet was pre-processed to generateadditionalmeta-informationsuchas the ISOweek,which isuseful forgroupingtweetsthatappearwithinthesameweekperiod(seeTable12).Togetabetterunderstandingofthedataset,thetweetswerecombinedintoonebodyoftextandnormalisedusingapre-processingsteptoenablevisualisingthemostfrequenttokensfoundinthetweets(seeFigure20).Thispre-processingstepconsisted of lemmatiser based on a lexicon of Englishwords (i.e.,WordNet), aspecialisedtweettokeniserthatnormalisesinformalspellingsofcommonwordsusedontwitter(e.g., “Hellooo”,getsnormalizedto“Hello”), thus increasingtheeffectivenessofbag-of-wordsapproachessuchas theTF-IDF (TermFrequencyInverse Document Frequency). Finally, other normalizations and stopwordremovalwerealsoperformedonthisbodyoftweets.Theanalysisofthesetweetsshowsthatthetoptokensforthetweetswererelevant,andshowsaglimpseofthetermsthatwillbecontainedinthebag-of-wordsfortheTF-IDFafterfittingittothedataset.Thetermswiththehighestfrequencywerevulnerability,linux,infosec,andwordpress.

D4.2

5252

Table12-Sampletweetsandtheirrelevance.

Figure20-Mostfrequenttokensfoundinthedataset.

Experimental setup The noise elimination model consists of an ensemble of SVMs. The hyper-parameters of these SVM models were optimised using a pipeline based onevolutionaryalgorithms.Givenatrainingdataset,thepipelineusesk-foldcross-validationtoevaluatenotonlywhichmachinelearningalgorithmismoresuitableforthedatasetgivenalistofmodelstochoosefrom(i.e.,[SVM,RandomForest]),but also optimises the parameters of the model, such as penalty, number ofiterations, and tolerance for early stopping given constraints. Other hyper-parametersthatareoptimisedincludefeatureextractionparameters(i.e.TF-IDFparameters)suchastherangeofngramstousewhenbuildingthebag-of-words,andtheminimumandmaximumdocumentfrequencyfortheterms.Inaddition,italso optimises the percentile of features to use after scoring the bag ofwordsfeaturesfromtheTF-IDFstep,usingachi-squaredsaliencyscore.Theoutputofthispipelineisasetofidealparametersforusegiventhedataset.Inotherwords,thepipelinefindsthemostoptimalmachinelearningalgorithm𝑓(𝑥),aswellasits

D4.2

5353

mostoptimalparametersθthatmaximisesthemedianf1-scoreof𝑓(𝑥; 𝜃)onthecross-validationfolds.Thisnotonlyhelpsreducesbias(i.e.,bias-componentofthebias variance decomposition) in the pipeline, it also has a number of excellentpropertiesthatcomewithevolutionaryalgorithms,suchasbeingabletoescapelocalminima,andbeingabletooptimisenotonlycontinuousspaces,butdiscreteonesaswell.After running this optimisation pipeline on the dataset, it converged on theparametersdescribedinTable13andTable14.Theseparameterswereusedtoinitialise an ensemble of voting classifiers of Linear Support VectorMachines,whichhaddifferences in their toleranceparameter forearlystopping,andmaxiterationparametertoensurediversityofthemembersoftheensemble.Table13-OptimizedparametersforTF-IDF.

TF-IDFparametersParameter Valuengramrange (1,2)

Maxdocumentfreq. 0.487%Stopwordremoval ON

(English)Sublineartermfreq.

scalingON

Table14-OptimizedparametersforSVM.

SVMparameterParameter ValueC(penalty) 1.28Multiclassfitting

“CrammerSinger”

Results Theresultsofthenoisemodelwerepromising,eventhoughthedatasetismadeupofpostsfromthesamedomain(i.e.cybersecurity),andnottotallydifferentdomainswheretermsdiffersignificantly.Figure21showboxplotsofthef1score,precision, and recall respectively,which summarize the 3 fold cross validationresultsonthedata.TheresultsshowthattheensembleofoptimizedSVMswaseffectiveinidentifyingnoise.Aplotoftheconfusionmatrices(seeFigure22)foreachfoldalsoshowsthatthevotingensemblesweredoingagoodjobofdifferentiatingbetweenrelevantposts(representedby‘yes’)andirrelevantposts(representedby‘no’).

D4.2

5454

Figure21-Noisefilteringresults:f1-score,precisionandrecall,respectively.

Figure22-Confusionmatricesforeachofthethreefolds,respectively.

Ingeneral, theperformancemetrics fornoiseclassificationusingtheoptimizedmachine learning pipeline proved effective at eliminating noise. Asmentionedpreviously, the noise elimination is notonly limited to filteringoutnoise from

D4.2

5555

OSINT data, but also in the search query which allows us to disambiguatehomonyms.Inconjunctionwiththisnoisemodel,itmakesforaneffectivenoisefilteringpipelinethatfiltersoutasmuchoftheirrelevantdataaspossible.

Threat identification Inthissection,theresultsofusingoptimisedbythepipelinesforpredictingmeta-information contained in the descriptors of exploits reported to ExploitDB arepresented.Thismeta-informationincludestheplatform,thetypeormeansoftheexploits, aswell as the port exploited. This is all valuable informationwhen itcomestotakingactionstosecuretheinfrastructureofanorganisationagainstanexploit.

Analysis of ExploitDB ReportsarepublishedonExploitDBwithanumberoffilesthatcontainconsoleinput/outputandcodethatreproducestheseexploits.Table15showssamplesreports fromexploitsDBalongwith thedescriptorsnormally filled inwith thereportoftheexploit.

Table15-SampleofreportssubmittedtoExploitDB.

Inordertogetabetterunderstandingofthedataset,themostfrequenttermswereanalysed using the same approach described in the previous section on noisefiltering. Some of the most frequent terms were Injection, SQL, Remote, File,Scripting,andOverflowandmadeupalargeportionofthedatabase(Figure23).Figure24showshowtheseexploitsweredistributedamongthedifferentplatforms.Themostvulnerableofthese,werephpandwindows.Thesearebothpopularplatformsandtheirpopularitycomesatthepriceofmakingthemvulnerabletothreatactorsprobingthemtofindexploits.

D4.2

5656

Figure23-Mostfrequenttermsfoundinthedescriptionoftheexploits.

Figure24-Numberofexploitsperplatform.

Experimental setup ForthisexperimentweusedanensembleofSVMsasdescribedonSection0.Priorto trainingthemodels, termsthatgiveawaytheclasswereremovedoutof the

D4.2

5757

descriptionstoavoidhavingthemodelsoverfitonthoseterms,andalsotoforceittolearnlatentfeaturesforbettergeneralisationresults.After running this optimization pipeline on the dataset, it converged on theparameterdescribed inTable16andTable17.Theseparameterswereusedtoinitialiseanensembleofbagging classifiersof SupportVectorMachines,whichtrained on different samples of the training data to ensure diversity of themembersoftheensemble.Table16-OptimizedparametersforTF-IDF.

TF-IDFparametersParameter Valuengramrange (1,2)

Maxdocumentfreq. 0.487%Stopwordremoval ON

(English)Sublineartermfreq.

scalingON

Table17-OptimizedparametersforSVM.

SVMparameterParameter ValueC(penalty) 1.28Multiclassfitting

“CrammerSinger”

Results Theresultsofpredictingmeta-informationsuchasthetargetedplatform,andthemeans of the attackwas very promising. This section highlights the results ofpredictingtheexploittypeandthetargetedplatformbasedonunstructuredtextin the formof thedescriptionof thevulnerabilityandconsole input/outputorcodethatreproducesthevulnerability.Thesedescriptionswerepre-processedtoremoveanytokensthatmentionthetargetedplatformstoavoidasituationwherethemodeloverfitsonthoseterms.

Predict platform affected based on exploit description The resultof training on the descriptions published on theTwitter account ofExploitDBandpredictingtheplatformtargetedbasedonthatshowedpromisingresultsasshowninFigure25.Specifically,theprecision,recallandf1scoreforthe3-fold cross-validation shows that themodelwas able to identify the targetedplatformreasonablywell.Themedianf1scoreliesinthe73.15%range,whichissignificantformulticlassclassificationwith8classes,specifically{‘linux’, ‘windows’,osx, ‘asp’, ‘multiple’,‘php’,‘jsp’,‘java’,‘hardware’},especiallyconsideringthatthereisaclass‘multiple’whichoverlapswithotherclasses.TheconfusionmatricesshowninFigure26,Figure27,andFigure28representthe results of themodel on the3 folds of cross-validation. Thediagonal of theconfusionmatrixisthemostimportanttokeepaneyeonasitshowshowmanytest examples that have never been seen by the model during training werecorrectly classified. Other cells in this matrix show the examples the modelsconfusedtheclasses.Inthiscase,themodehadamajorityofthetrainingexamples

D4.2

5858

onthediagonalofthisconfusionmatrix.However,italsoconfusedsomeclasseswithothers.Someofthemostcommonconfusionswerebetween‘windows’and‘multiple’, ‘window’ and ‘linux’, and ‘asp’ and ‘jsp’. Confusing ‘windows’ and‘multiple’wasthemostlikely,and itmakessensebecause ‘multiple’meanstheexploittargetsmultipleplatforms,windowsincluded.

Figure25-Platformpredictionbydescriptionresults:f1-score,precision,andrecall,respectively.

Figure26-Confusionmatrixofthe1stfoldofthe3foldcross-validation.

D4.2

5959

Figure27-Confusionmatrixofthe2ndfoldofthe3foldcross-validation.

Figure28-Confusionmatrixofthe3rdfoldofthe3foldcross-validation.

D4.2

6060

Predicting exploit type based on exploit description InthissectionwetrainedthemodeltopredicttheexploitstypebasedontheshortdescriptionspublishedonTwitter.ThedistributionofthescoreswasmoreevenandoveratightvarianceasshownbyFigure29.

Figure29-Exploittypepredictionbydescriptionresults:f1-score,precision,andrecall,

respectively.

Theeffectivenessofthemodelisalsodemonstratedbytheconfusionmatrices(seeFigure30,Figure31,andFigure32).Thediagonalofthesematrices,whichhasthecountofthenumberofcorrectlyclassifiedtestexamplesfromthecross-validationfoldsarethemajority.Italsorevealsthattheclassifiertendstoconfuseexploitsof‘remote’means for ‘webapps’(andvice-versa)abitmorethan inotherclasses.Thisprobablycomesasaresultofsomeoverlapofthecontentofthetwo.Ingeneral,thiswasshowntobeeffectiveatpredictingthemeansoftheexploitsfrom the short description, even if terms that give away the approach werestrippedfromthetext.

D4.2

6161

Figure30-Confusionmatrixofthe1stfoldforthe3foldcross-validationforpredictingexploittype.

Figure31-Confusionmatrixofthe2ndfoldforthe3foldcross-validationforpredictingexploit

type.

D4.2

6262

Figure32-Confusionmatrixofthe3rdfoldforthe3foldcross-validationforpredictingexploittype.

Predict exploit type through console input/output and code TheresultsofusingtheconsoleI/Oandcodeforpredictingtheexploittypehadsignificant results that showed the effectiveness of the ensemble at telling theexploit type.Theprecision, recall and f1 score showedmorepromising resultscomparedtousingshortdescriptions(seeFigure33).

Figure33-Exploittypepredictionresults:f1-score,precision,andrecall,respectively.

D4.2

6363

Compared tousing theshortdescriptionspublishedon twitter, the console i/owerefarmoreeffectiveasasourceofinputforthemodel.Inparticular,themedianprecisionwasatthe89.2%mark,comparedtousingshortdescriptionswhichwasat 73 %. The confusion matrices shown in Fig. 30, Fig. 31, and Fig. 32 (seeAppendix)demonstratestheeffectivenessofthismodelatbeingabletoidentifythe targeted platform. Specifically, it can be seen that it had fewer problemsdifferentiatingbetween‘remote’and‘webapps’inthiscase.

Predicting exploits’ target through console input/output and code The results of training on the Console I/O and code used to reproduce theseexploitsandpredictingtheexploits targetedplatformwasalsoverypromising.Interestingly,ithadcomparableresultstothefirstapproachthatusedtheexploitsdescriptionastheinputfortraining.Theprecision,recallandf1scoresinFigure34,respectivelyillustratethis.Ingeneral,themediansofthef1score,precisionandrecallscoresfromthe3foldsofcross-validationallliewithinthe85%mark.

Figure34-Platformpredictionresults:f1-score,precision,andrecall,respectively.

3.2.2.6 Conclusions In summary, it has been shown that the optimised voting ensemble for noisefiltering is effective. Furthermore, it has also been demonstrated that theoptimizedbaggingensembleofSVMsiseffectiveforpredictingmeta-informationthatistypicallyfilledinbycybersecurityprofessionals.Specifically,itwasshowntohaveeffectiveresultsbothinpredictingthetargetedplatformandexploittype.Thiswas shown tobeeffectivebothusingunstructured text in the formof thedescription normally published on the Twitter account of ExploitDB and theconsole I/O or code used to reproduce the exploits. It was shown that thepredictionswereparticularlymoreeffectivewhentheinputwasconsoleI/Oorcode used to reproduce the exploits.This also demonstrates the valueof opensourcevulnerabilitydatabases,whichmakeexcellenttrainingdata.

D4.2

6464

4 Context-Aware OSINT Integration OneoftheweakestpointsinactualSIEMsisthedataretrievalfromOpenSourceIntelligence(OSINT)[DIS21],aswellashowthiskindof informationshouldbeprocessed and normalized, considering theirunstructured nature. In [DIS41] adetailed description of OSINT data fusion and analysis techniques has beenperformed,however,inordertoinjectthiscybersecurityrelatedinformation(e.g.,IoC)directlyintoSIEMs,itisnecessarytocorrelateitwithreal-timedatacomingfrom the monitored infrastructures. In this way, incoming data could beprioritized, allowing a faster incident detection and response when it will beinjectedintoSIEMs.Asstatedin[DIS41],thistaskwillbeperformedbyaDiSIEMcomponent named Context-aware Intelligence Integrator. It will receive cyberthreat information from OSINT-based components (e.g., the Threat Predictor,developed by FCiências.ID, and the listening247 platform, developed by andDigitalMR) and other external sources of structured information, in order tocorrelateitwithbothstaticanddynamicinformationcomingfromthemonitoredinfrastructure. The correlated informationwill be then enrichedwith a threatscorethatusesaheuristic-basedanalysis,(asdetailedinSec.5.4),andexpressesthefinalresultswiththeSTIX2.0standard[OAS18].InthenextsectionthenoveltyandtheimportanceofthecomponentintheDiSIEMcontextwillbehighlighted.Moreover,weprovideacomparisonamongthemostcurrentlyusedopen-sourceThreat IntelligentPlatform(TIP)beforedescribingthefinalarchitectureofthecomponentitself.Thelastsectionswillberelatedtothecomputationdetailsoftheheuristicanalysisperformedbyourcomponent.

4.1 Novelty of the Component SIEMsareessentialtoolsforeveryorganizationnowadays,theyallowreal-timemonitoringofinternalandrelevantassets,collectingvariouskindofinformationfrommultiple sources, both internal and external, raising alarms if somethinganomalous isdetected.However, theycomewithmany limitations,asstated in[THR17], [THR15] and [DIS21], especially in terms of ad-hoc importing ofunstructureddatacomingfromexternalcyberthreatintelligencesources,suchasOpen-Source Intelligence (OSINT), which could lead to a high number of falsepositives, as well as detection of unknown events and advanced analysiscapabilities for inferring detailed information about Tactics, Techniques andProcedures of attackers, which could be used for speeding up both decisionmakingandincidentresponse.InordertounderstandwhythiscomponentisnecessaryfortheDiSIEMproject,itisimportanttodeeplyunderstandthedifferencebetweenThreatIntelligence(TI)andThreatData (TD). The former is a conceptwidelyused nowadays, both inacademicandindustrialworld.Thereisstillnotapreciseanduniquedefinitionofit.RobMcMillan(McMillan,2013)definesTIasevidence-basedknowledgerelatedtocyberthreats,whichaimsatimprovingdecisionmakingandthreatdetection,speedingupincidentresponsephase.Anotherdefinition,givenbyHenryDalziel(Dalziel, 2014), states that TI is specific information which must meet threespecificcriteria:(i)itmustberelevant,fortheentitywhoreceivesit,(ii)actionable

D4.2

6565

and(iii)valuable,fromabusinessperspective.Moreover,ThreatConnect[THR18],anAmericancyber-securityfirm,focusingonmoretechnicalaspects,affirmsthat“TI is the knowledge of a threat’s capabilities, infrastructure, motives, goal andresources.Theapplicationofthisinformationassistsintheoperationalandstrategicdefenseofnetwork-basedassets”.In[ENI14]theconceptof“actionableinformation”isexplainedbytheEuropeanUnion Agency for Network and Information Security (ENISA), from anorganizationpointofview:itreferstoinformationthatcanbeusedimmediatelyfor specific and strategical decision making. Always considering [THR15] and[ENI14],information,inordertobe“actionable”,mustmeetthefollowingcriteria:

• Relevance:itmusthavesomeimpactsonspecificreceiver’sassets,suchasnetworks,softwareandhardware.Thatis, indicatorsofcompromisewillusuallybeconsideredrelevantwhena threatcouldaffect themonitoredinfrastructure. In order to determine the relevance, it is crucial todeterminetypesofthreatstargetingyourassets/systems,consideringreal-timeinformation(e.g.,IoC),frommanyinternalsources,becausetheyareable to provide dynamic and continuous information about currentinternal monitoring operation, together with a global view of theinfrastructurestatus.

• Timeliness:TIismorereliablewhenitallowsdetectingattacker’sactivity,evenafterchangesorevolutionsintermsofcapabilitiesandinfrastructure.Moreover,informationabouteventsolderthanafewhoursare,mostofthetimes,irrelevantandnon-actionableduetothehighlevelofdynamicityofsomethreat’scharacteristics,consideringthatsomethreatsarediscoveredandanalysedmonthsaftertheinitialcompromise.

• Accuracy:thereceiversideshouldbeabletoprocessthereceiveddataassoon as possible. It depends mainly on three factors, which are theconfidentofthesourcefromwhichdataisretrieved,thetrustlevelplacedin those sources (which, in turn, could depend on factors such as falsepositives/false negatives rates) and the local dynamic context of thereceiver.Thelatteriscrucialinordertoavoidinaccurateresultsandeffortswhendealingwithincidentresponse.

• Completeness:TI shouldprovidevaluableandcomplete information tothefinalreceiver(SIEMsinDiSIEMcase).Consideringthehighdynamicityofhiscontext,itisimportanttosaythat,likeaccuracy,completenessmustalways be associated to the context of the final destination of theinformation.Sometimes,sourcesare incompletewhenconsideredalone,but theirprovideddatabecomeactionableonce combinedorprocessedwithotherinternaldataavailabletothedestinationorreceivedfromotherexternalsources.

• Ingestibility: received informationmust be easy to ingest into internaldatamanagementsystemsforfurtherprocessingandanalysisphases.Thisisachievableusingspecificstandardsforrepresentingthisdata,allowingthe receiver to process data as fast as possible, helping also securityanalysts, aswell as through the usage of specific transfer protocols forsharingtherelatedintelligence.

• Variety:detectionandpreventionshouldnotrelyonasingletechniqueortool.Itiscrucialtouseacombinationofsystems,tools(e.g.,IDS,IPSand

D4.2

6666

Firewalls)andsources(e.g.,OSINT),especiallywhentheyareabletodetectthethreatatdifferentlevelsofintrusions(killchainphases).

After these considerations, it is clear that TI is a subjective concept: someinformationthatisconsideredTIforaspecificentitycouldnotbeconsideredTIfromanotherentitypointofview.Alltheincomingdata,bothstructuredandunstructured,isdefinedgenericallyasThreat Data, and itmust not be considered intelligence. It could be only afterspecificprocessingandanalysisstepswhichallowcombiningitwithinternalanddynamicinformation,aswellasThreatDatareceivedfromotherexternalsources[THR15] [FLO16], in order to meet the above criteria, adding, if possible,contextual information, which could havemore importance than the indicatoritself[MOH17].The OSINT-based components developed in the DiSIEM project, use staticinformationaboutthemonitoredinfrastructuresfortheirprocessingtechniques,therefore, they are able to provide partial intelligence, especially in terms ofrelevance, accuracy, completeness and variety. With the Context-AwareIntelligence Integrator,we aim at converting threat information received fromOSINT-basedcomponent,aswellasfromselectedexternalfeeds,inactionableTI,readytobeinjectedintheSIEMs,withtheadditionofathreatscore,consideringtheabove-mentionedcriteria.Moreprecisely,theimprovementsforeachcriterionwillbe:

• Relevance:OSINT–basedcomponentsareabletoinfertherelevanceonlychecking retrieved data against some high-level and static informationprovidedbythemonitoredinfrastructure.Theydonotconsideranykindofdynamicinformationorspecificpasteventdetectedintheinfrastructureitself,thustheyarenotabletoprovideadetailedrelevancedegree.Thistask is done byour component, and some of the enriched eventwill beprovided to these components as feedback, for improving their analysiscapabilities.

• Timeliness: OSINT-based components are able to collect and processcybersecurityinformationinreal-time,andtheycanmeetthiscriterion.However, theyarenotable toconsiderpossiblepastrelatedevents,andthey cannot make this association, which implies that the criterion ispartiallymet.Theyarenotabletorecognizethatadetectedeventcouldberelatedtoanalreadydetectedone,bytheinfrastructureorbyotherOSINT-basedcomponents,andthat theseeventscouldrefer to thesamethreat,but,forexample,atadifferentlevelofintrusion.Ourcomponentaimsatovercomingsuchlimitations.

• Accuracy: the Context-aware Intelligence Integrator will consider thedynamic context of each monitored infrastructure (not considered byOSINT-basedcomponents).Itwillperformamatchbetweeninformationreceived from them and IoC dynamically received from eachmonitoredinfrastructure.TheresultoftheanalysiswillbesentnotonlytotheSIEMs,butalsototheOSINT-basedcomponents,asafeedback.Inthiswaytheycanimprovetheirprocessingphases,and,atthesametime,itallowsthemtohaveaglobalviewoftheeffectivenessoftheenrichedIoCintherelated

D4.2

6767

infrastructure. Besides, if the same event is detected bymore than onecomponent/infrastructure,theaccuracycouldbehigher.

• Completeness: Similar to accuracy, if dynamic information related toassets or past events are not considered, the reachable level ofcompletenesscouldnotbefulfilled,becausetheIoCcouldnotbeenrichedconsideringthiskindofinformation.

• Ingestibility: in terms of standards used for representing threatintelligence, OSINT-based components should be able to provide datarepresentedthroughspecificstandards(e.g.,STIX2.0,MISPformat,JSON).

• Variety: a single OSINT-based component is only able to consider thesourcesfromwhichinformationhasbeengathered(e.g.,Twitter).Instead,our component has a more global view, considering a wide number ofexternal and internal sources, and it can check, forexample, if thesamemaliciouseventhasbeendetected,orgathered,fromdifferentsources.

4.2 Threat Intelligent Platforms comparison InordertoovercomeSIEMslimitations,manycompaniesstartedrelyingonso-calledThreatIntelligentPlatforms(TIPs)[THR15].Theyareinchargetoretrieveboth structured and unstructured data from diverse external sources, andperformvariouscomplexoperations,suchasfiltering,aggregation,normalization,detection, analysis and enrichment, injecting the result directly in the SIEM.However,theirimplementationandusagearestillintheirinfancyand,asstatedin[CLE17],many limitationshavetobeaddressedyet, forexample in termsofdynamictrustassessmentofexternalsourcesandadvancedanalysiscapabilities,wheretoomuchmanualworkisstillneeded,especiallyformakingtheretrievedinformationeffectivelyactionable.TIPsareprettygoodfordatacollection,normalization,storage,sharingandforintegration with SIEMs. Considering that the Context-Aware IntelligenceIntegratormust interactbothwithexternal(e.g.,OSINT-basedcomponent)andInternalsources(e.g.,SIEMs)ofthreatinformation,andstorethisdatainordertoperformtheheuristicanalysisforthethreatscoreevaluation,aTIPcanrepresentagoodsolution,asastartingpoint.MoredetailswillbegiveninSection4.3,whenthe architecture of the component will be described. Many TIPs have beendevelopedsofar,howevernotsomanyhavebeenreleasedwithanopensourcelicense.Theopen-sourcesolutionsthathavebeenidentifiedarethefollowing:

• TheMalwareInformationSharingPlatform(MISP)[MIS18],• TheCollectiveIntelligenceFramework(CIF)[CSI18],• TheCollaborativeResearchIntoThreats(CRITs)[MIT14],and• SoltraEdge [SOL18] only a limited version is availablewith this kind of

license.The comparisonamong themhasbeenmade taking intoaccount the followingsurvey [WIE18], which considers the following criteria (some personalconsiderationshavealsobeenadded):

• Import/Export format: MISP and CRITs appear to be the best forsatisfying this criterion. They are able toworkwith a huge number of

D4.2

6868

formats (e.g., pdf, doc, xls, txt, JSON, xml, STIX). Moreover, the formersupportsalsoanad-hoc standard for representingTI, theMISP format,which is a customized JSON, and built-in capabilities for convertinginformation represented through it into STIX 2.0 data. It allows alsoaddingmodules for ad-hoc importing/exportingwithoutmodifying thecore functionalities.CIF isnot flexibleas theprevioustwo,especially ifsomespecificstandardswillbeconsidered(e.g.,STIX),whilethefreeandlimitedversionofSoltraEdgeisnotabletoimportnon-STIXdata(theNon-STIXDataSourcePluginisnotavailable).

• Integrationwith/Exporttostandardsecuritytools:MISPseemstobethebestinsatisfyingthiscriterion.ItallowsaneasyinteractionwithIDSand SIEMs, besides it contains a very flexibleRESTAPI for integratinginternalsolutionwiththeplatform.CIFisalsoaviableplatformfromthispoint of view, i.e.,when integrationwith IDS and SIEMs is considered.However,itislessflexiblethanMISP.CRITsisahugerepositoryofTI,notspecifically built for interacting with systems such as SIEMs and IDS,however his flexibility allows building ad-hoc solutions for thesepurposes.Finally,forSoltraEdge,itcanbestatedthatthefreeversionhasmanylimitationsalsofromthispointofview,especiallyintermsofAPIsupportforinteractingwiththeplatform.

• Support of collaboration: MISP allows following both a centralized,where the same instance is shared among a trusted community, and adecentralized approach,where each entity possesses hisown instance,withitsprivatedatabase,andtheinteractionamongdifferentinstancesisperformedinapeer-to-peerway.CIFinsteadallowstheusageofaprivateinstance, aswell as the implementationof a shared instance throughacentralizedservice.RegardingCRITsandSoltraEdge, they followaverysimilar approach to the one followed by CIF, allowing the usage of aprivateinstance,orofasharedoneinthecontextofatrustedcommunity.However,CRITshasverypoorbuilt-insharingcapabilities.

• DataExchangeStandards:MISPandCRITsareabletodealwithmanydifferentstandards,alsoveryspecificsuchasSTIXandTAXII[OAS181].SoltraEdgehasbeenbuiltforworkingespeciallywithstandardslikeSTIXandTAXII,butthelimitedversionhasverypoorcapabilitiestodealwithstandardsdifferentfromtheonesmentionedabove.CIF,instead,isnotsoflexible from this point of view, when capabilities for sharing, orinteracting,withexternalentitiesareconsidered;ithasbeenthoughtforworking with other CIF instances, without, or partially, supportingstandardssuchasSTIXandTAXII,usingprivatesolutionsformeetinghighperformancerequirements.

• Analysiscapabilities:highanalysiscapabilitiesareanactualweakpointofeveryTIP,notonly for thoseconsidered inthisanalysis,asstated in[CLE17]. Considering these four TIPs, it can be stated that CRITs andSoltraEdge, being more a huge central repository for collaboratinganalysis than pure sharing platforms, have better built-in analysiscapabilities thanMISPandCIF.However, talkingaboutSoltraEdge, thisadvantageispartiallylostwiththelimitedversion.

• Graph generation: visualization capabilities are strictly related to theanalysis features consideredabove, and the sameconsideration canbe

D4.2

6969

deducted.Moregenerically,thisisanotherlimitationthatactualTIPshave[CLE17].

• License: if the limitedversionof SoltraEdge is considered, all theTIPstakenintoaccountinthiscomparisonarereleasedwithanopensourcelicense.

• Hardwarerequirements:MISP,CRITsandSoltraEdgehaveverysimilarrequirements in termsofRAMandharddisksizeneeded.CIF, instead,requiresalittlebitmore,especiallyintermsofprocessingcapabilities.

Attheend,thechoiceshouldbedoneconsideredwhatisreallyneededinDiSIEM.Inthiscase,theplatformthatadaptsthebestisMISP,especiallyconsideringtheintegrationwithSIEMsandIDS,thehighflexibilityfeaturesforintegratingeasilyinternalandcustomsolutions(thatis,theheuristicanalysisengine)andthegoodsupportofspecificdataexchangestandard,suchasSTIX.AnotheradvantageofMISPistheavailabilityofaverydetailedonlinedocumentation[MIS],aswellasahuge and responsive online community, in case of development issues or anyotherdoubts.Table18showsahigh-levelviewoftheTIPanalysis.

Table18-TIPcomparison.

MISP CIF CRITs SoltraEdgel.v.

Import/ExportFormat J K J LIntegrationCapabilities J J K KData ExchangeStandards J K K KSupport ofCollaboration J J K KAnalysisCapabilities K K J KGraphGeneration K K J KLicense J J J KHardwareRequirements K L K KIn thenextsection,wedescribetheupdatedarchitectureof theContext-AwareIntelligenceIntegratorthatincludesaMISPinstance.Thisisanupdatedversionoftheonedescribedin[DIS41].

4.3 Context-Aware Intelligence Integrator Architecture Thefirstproposedarchitectureforthiscomponenthasbeendescribedin[DIS41].In thisdocument, the finalarchitecture isgoingtobepresented,asanupdatedversionofthepreviousone,anditcanbeseeninFigure35.Themaininnovationfocusesontheusageof theMISPplatform:this implies that theContext-Aware

D4.2

7070

Intelligence Integrator has been divided into two main modules: (i) a MISPInstance,inchargeofgatheringdatafrombothOSINT-basedcomponentandallthe monitored infrastructures, and sending the enriched IoCs to the relatedSIEMs;(ii)theHeuristicModule, inchargeofperformingtheheuristicanalysis,withthefinalaimofcomputingtheThreatScore,enrichingthedatacomingfromOSINT-basedcomponent,andsendingitbacktotheMISPInstance.MoredetailsabouttheheuristicanalysiswillbegiveninSection4.4.TheintegrationamongtheSIEMsandtheContext-AwareIntelligenceIntegratorwillbeeasierthankstotheadoptionofMISP.Theobjectiveistouse,asmuchaspossible, the built-in sharing capabilities of the platformwhen this interactiontakesplace,suchasazeroMQpublish-subscribemodel[IMA14].AccordingtotheDiSIEM principles stated in [DIS22], SIEMs should not bemodified due to ourextensionsandnoadditionalor significantmanualworkshouldbe required tooperatewiththem. Inthiscase, theContext-AwareIntelligenceIntegratormustprovidetherequiredflexibilitywheninteractionwiththesesystemsshouldtakeplace.

Figure35-Context-AwareIntelligenceIntegratorArchitecture.

Luckily,MISPcomeswithso-called“MISP-modules”,usedbothforad-hocimportandexportofthreatinformation.Forinstance,talkingaboutArcSight[MIC18],oneoftheSIEMsconsideredinDiSIEM,aspecificexportmodulecouldbeusedinorderto export internal event through the CEF format, supported by ArcSight itself.Moreover,newmodulescouldbecreated fromscratchand integratedwiththeMISP Instance, without modifying the core functionalities of the platform, ifneeded.

D4.2

7171

Another feature that must be provided, according to the integration plan in[DIS22],consistsinlettingtheSIEMsbeabletoretrievespecificeventsdirectlyfromthecomponentdatabase.TheMISPdatabaseisaMySQLdatabasewhereallthe events, with the related information, are stored in a structuredway. Eachexternalcomponentwillbeable to interactdirectlywith it throughsomeRESTAPIs provided byMISP, through a specific python library called PyMISP [MIS].Eachcomponentwillhaveaspecificauthenticationkey,generatedbyMISPitself,whichwillbecheckedeverytimearequestisreceivedbytheplatform.Thankstospecificexportmodules,thestoredeventcanberetrievedinvariousformats(e.g.,JSON,XML,andSTIX).WiththeadoptionofMISP,alsotheinteractionwiththeOSINT-basedcomponentswillbeeasiertoimplement.Infact,alsotheFCiências.IDThreatPredictorwillrelyonhisownMISPinstance.Inthiswaytheinformationexchangebetweenthetwoinstanceswillbeperformedinanautomatedway,exploitingthebuilt-insharingcapabilities ofMISP itself. Regarding the communicationwith the listening247platform,aspecificimportmodulewillbeimplementedandintegratedintheMISPinstance of our component. Otherwise, another solution could be letting thelistening247 platform be able to inject data directly into our MISP instance,throughtheRESTAPIsprovidedbythelatter(inthiscasetheMISPformatshouldbeusedforrepresentingincomingThreatData).As explained before,MISPwill not deal directlywith the heuristic analysis. Inordertonotmodifythecoreoftheplatform,agoodchoiceistointegrateitwithanothermoduleinchargetoperformthistask,ascanbeseeninFigure35.The Heuristic Module will receive all the data coming from the monitoredinfrastructures,throughMISP;itcouldbedynamicdata(e.g.,IoCdetectedintheinfrastructures) as well as static and generic information about a specificinfrastructure(e.g.,usedsensors,operatingsystems,specificlistsofIPaddresses).ThisdatawillnotbestoredintheMISPdatabase,itcouldberepresentedthroughtheJSONformat(e.g.,STIX,MISPevents),orthroughsimpledocumentsrelatedtosomegenericinformation.Besides,itwillnotbesharedwithotherentities,anditsusagewillinterestonlytheheuristicanalysis.Forthisreason,itcouldbeeasiertostore this information in a different way, for example using a private non-relationaldatabasesuchasMongoDB6,simplifyingtheretrievaloftheinformationbytheheuristicengine,whenneeded,forhavingfullcontrolofhowtheanalysisisperformed.TheenrichedIoC,withthecomputedThreatStore,willbesentbacktoMISPandstored in the MISP database, considering that it will be shared externally.Optionally,someIoCsreceivedbythemonitoredinfrastructures,couldbestoredintheMISPdatabase,inordertoperformbasicautomatedcorrelationsteps,whensomeOSINTdataarereceived,beforeperformingtheheuristicanalysis.

6 https://en.wikipedia.org/wiki/MongoDB

D4.2

7272

RegardingtheHeuristicModule,thetasksperformedbytheHeuristicEngineandbytheThreatScoreAgentarethesameastheonesdescribedin[DIS41],withthedifferencethatthelatterwillreceiveonlythecomputedthreatscorefordirectlyupdatingtherelatedeventintheMISPdatabase.ItdoesnotneedtointeractwiththeHeuristicModuledatabaseortoprovideanyinterfaceforcommunicatingwiththeSIEMs,consideringthatthisjobwillbehandledbyMISP,asdescribedbefore.In the next section,more details about the heuristic analysiswill be took intoaccount.

4.4 Context-AwareThreatScoreAnalysisThis section reviews different aggregation operators that could be used as afunction to evaluate and analyse the heuristics defined for the threat scorecomputation.Aggregation operators are mathematical functions that are used to combineinformation (e.g., N numerical values) in a single datum. There exist a largenumberofdifferentaggregationoperatorsthatdifferontheassumptionsonthedata(datatypes)andaboutthetypeofinformationthatwecanincorporateinthemodel[VIC07].Thereminderofthissectiondescribesthemostusefulaggregationoperatorsfromthebibliography.1. ArithmeticMean(AM):Itisthesumofacollectionofvalues(X)dividedbythetotalnumberofelements(t)inthecollection[VIC07],[SRI09],asshowninEquation1.

AM =∑ 𝑋𝑖rQs/

𝑡 (1) Having,forinstance,fiveelementstoevaluate,thearithmeticmeanwouldbeequaltoAM=(X1+X2+X3+X4+X5)/5.Oneaspectworthnotinginconnectionwiththearithmeticmeanisthatallofthevaluesshouldbeonthesamescale.Itisnotpossibletocomputetheaverageofthreelitters,fivecentimetres,andfourkilograms,becausetheycannotbeconvertedtocommonunits.

2. GeometricMean (GM): It is a typeofmeanoraverage that indicates thecentraltendencyortypicalvalueofasetofnumbersbyusingtheproductoftheirvalues(asopposedtothearithmeticmeanwhichusestheirsum).Thegeometric mean is defined as the tth root product of the t numbers asexpressedinEquation2.

GM =uv𝑋𝑖r

Qs/

w

/r

(2)

Having,forinstance,fiveelementstoevaluate,thegeometricmeanwillbeequalto GM = (X1.X2.X3.X4.X5)1/5. The geometric mean is more stable than thearithmeticmean,inthesenseofbeinglessaffectedbyoutlyingvalues.However,whenanyofthevaluesinasetiszero,thegeometricmeanoverthatsetisalsozero[SRI09]. 3. HarmonicMean(HM):Itisanothercentraltendencythatistypicallyusedtocombinerates,andforscoreaggregation.Itisdefinedasthereciprocalof

D4.2

7373

thearithmeticmeanof thereciprocalsof thegivensetofobservations,asdepictedinEquation3.

𝐻𝑀 =𝑡

∑ ( 1𝑋𝑖)rQs/

(3)

Having,forinstance,fiveelementstoevaluate,theharmonicmeanwillbeequalto HM = 5/(1/X1 + 1/X2 + 1/X3 + 1/X4 + 1/X5). The harmonic mean isundefinedifanyofthesetvaluesarezero[SRI09]. 4. AdjustedHarmonicMean(AHM):Inordertoavoidthosecasesinwhichthe harmonic mean is undefined (e.g., when a set of values is zero), anadjustedharmonicmeanisproposedinEquation4,withE=0.01[SRI09].

𝐴𝐻𝑀 = 𝑡

∑ ( 1𝑋𝑖 + 𝐸)

rQs/

− 𝐸(4)

Having, for instance, fiveelementstoevaluate, theadjustedharmonicmeanwill be equal to AHM = 5/(1/(X1+0,01) + 1/(X2+0,01) + 1/(X3+0,01) +1/(X4+0,01)+1/(X5+0,01))-0,01.5. Weightedmean (WM) : It is similar to the arithmeticmean, except that

insteadofeachofthedatapointscontributingequallytothefinalaverage,some data points contribute more than others. Equation 5 shows themathematicalcomputationoftheweightedmean,wherePiistheweight,andXirepresentsasetofmeanvalueswithnonegativeweight[VIC17].

WM ={𝑋𝑖. 𝑃𝑖r

Qs/

(5)

Having,forinstance,fiveelementstoevaluate,theweightedmeanwillbeequaltoWM=X1.P1+X2.P2+X3.P3+X4.P4+X5.P5

6. OrderedWeightedAveraging(OWA):ItmodelsanaggregationprocessinwhichasequenceAofnscalarvalues(a1,…,an)areordereddecreasinglyand then weighted according to their ordered position by means of aweightingvectorW(w1,…,wn),asdepictedinEquation6.

OWA(a1,… , an) ={𝑃𝑖. 𝐵𝑖�

Qs/

(6)

Where:Picorrespondstotheweightoftheithdataafterorderingthem.Bicorrespondstoapermutationoftheaisothattheyareorderedfromthelargestonetothelowestone.Therefore,a1willbethelargestoftheaiandanwillbethelowestoftheai.

Weights should be positive and add to one. In this way, weights allowexpressingwhether the importance is given to low,highor centraldata[VIC07],[SAD11],[CHR10].Having,forinstance,fiveelementstoevaluate,theweightedmeanwillbeequaltoWM=P1.AMax+P2.AMax-1+P3.AMax-2+P4.AMax-3+P5.AMax-4

D4.2

7474

Example:Consideringforinstance,threeattributestobeevaluated(i.e.,A1,A2,A3),eachofwhichhasinformationoffiveheuristics(e.g.,X1,X2,X3,X4,X5),andconsidering that heuristics will be affected to the following weight (P1=0,10;P2=0,25;P3=0,40;P4=0,15;P5=0,10),theresultsofeachoftheaforementionedscoreaggregationfunctionsissummarizedinTable19.

Table19-ExamplesofScoreAggregationFunctions

Heuristics ScoreAggregationFunctionsAttributes X1 X2 X3 X4 X5 AM GM HM AHM WM OWAA1 3 4 3 1 5 3,20 2,83 2,37 2,37 3,15 3,25A2 5 2 2 4 0 2,60 0,00 - 0,04 2,40 2,60A3 1 1 2 3 3 2,00 1,78 1,58 1,58 1,90 2,10AsdepictedinTable19,individualheuristicsvaluesarealwayspositive(X>=0).Suchvaluesmeasurethereliabilityoftheheuristicinidentifyingagiventhreat.Inaddition,mostoftheequationsresultintoasimilarvalue,exceptforGMandHM,thatdonotworkwhen the individualheuristicvalue is zero.Wecan thereforeselectthesimplestequation(AM,WM)orthemostcomplexone(OWA).Selectedaggregationoperator:WeightedMean Particularly, theweightedmean (WM) seems to be an appropriate function tomeasureseveralheuristicsandassignaweighttoeachofthem.Wecanhave5or6 groups of heuristics, considering the criteria discussed in Section 4.1 (i.e.,relevance, accuracy, timeliness, completeness and variety). Each group ofheuristicswillbecomposedofoneormoreattributesandwillbeaffectedtoaweightingfactorbasedonitsimportanceincomputingthefinalthreatscore.Ingestibilitywillnotbeconsidered,becauseallthereceiveddatawillbealreadyexpressedinastructuredway,andthereceptionwillbehandleddirectlybytheMISP instance. This criterion would have been really meaningful in case ofreception of unstructured information, but this scenario does not interest theContext-Aware Intelligence Integrator.Theanalysiswill focuson theother fivecriteria, mentioned always in Section 4.1, with the possibility of adding newcriteriainthefuture.ConsiderationswhileselectingtheThreatScoreFunction

1. Avoidindeterminateresultsand/ornullvalues;2. Functionthatcanbeusedifoneormoreindividualscoresarezero;3. Individualscoresareassumedtohavedifferentweightsdependingonthe

sourceandtherelevanceoftheinformation;4. Weightsarepercentagesthatmustsumone.

WeightingFactors:Thefollowingcriteriahavebeenusedtogiveweighttoeachattributeandtocomputethethreatscorefortheheuristic:

D4.2

7575

• Relevance:Thiscriterionevaluatesiftheinformationassociatedtoagivenattributeisusefultoidentifyathreat.Relevanceiscomputedasfollows:Attributewithnodataà0OptionalAttributeà1Attributedoesnotidentifythreatbuthelpsintheanalysisà5Attributeisusefultoidentifythreatà7Mandatoryattributetoidentifythreatà10

• Accuracy: Information coming from OSINT-based components will be

comparedtotheinformationcomingfromtheinfrastructure,ifthereisamatch of one ormore attributes, a scorewillbe computed. Accuracy iscomputedasfollows:Attributewithnodataà0Attributehassomedatawithnomatchà1Thereisamatchofonesourceandtheinfrastructureà5Thereisamatchoftwosourcesandtheinfrastructureà10

• Timeliness: It evaluates if a detected event is related to an already

detectedone,bytheinfrastructureorbytheOSINT-basedcomponents,andif for instance,sucheventsrefer to thesamethreat,butwithadifferentlevelofintrusion.Timelinessiscomputedasfollows:Attributewithnodataà0Attributehasneverbeenseenà1Attributehasbeenseenwiththesamevalueà5Attributehasbeenseenwithadifferentvalueà10

• Variety:Thiscriterionevaluatesthesourcesfromwheretheinformation

is originated or detected e.g., infrastructure, OSINT-based components.Varietyiscomputedasfollows:Attributewithnodataà0Datacomefromonlyonesourceà1Datacomefromtwosourcesà5Datacomefromallsourcesà10

• Completeness(Cp):Thiscriterioncanbeusedasanoverallassessmentof

theheuristicandnotforindividualscoreevaluationoftheattributes.Each heuristic is composed of one or more attributes (e.g., Indicator iscomposed of nine attributes: (I1) indicator_type, (I2) modified, (I3)created,(I4)valid_from,(I5)external_reference,(I6)kill_chain_phases,(I7)pattern,(I8)OSINT_source,(I9)source_type).Completenessismeasuredasthenumberofattributeswithanon-emptyvalueoverthetotalnumberofattributes,asshowninEquation7.

𝐶𝑝 =𝑁𝑜. 𝑜𝑓𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠𝑤𝑖𝑡ℎ𝑛𝑜𝑛 − 𝑒𝑚𝑝𝑡𝑦𝑣𝑎𝑙𝑢𝑒𝑠

𝑇𝑜𝑡𝑎𝑙𝑁𝑜. 𝑜𝑓𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 (7)

Forinstance,letusassumethatforagivenevent,wehaveinformationofsixattributesbelongingtotheheuristicIndicator,thecompletenessvaluewillbetherefore,Cp=6/9=0.67.

D4.2

7676

ProposedThreatScoreFunction(TS)Theproposedfunctiontocomputethethreatscoreiscomposedasthesumofallheuristicvaluestimesitscorrespondingweightingfactor.Thislatterconsiderstherelevance(R),accuracy(A),timelines(T)andvariety(V)criteria.Thesumisthenaffectedtothecompletenessparameter(Cp),asdepictedinEquation8.

𝑇𝑆 = 𝐶𝑝. u{𝑋𝑖. 𝑃𝑖r

Qs/

w(8)

Example:Let us consider the indicator heuristic, composed of nine attributes,someofwhichhaveprovidedtheinformationshowninTable20.

Table20-EvaluationoftheIndicatorHeuristic

Indicator R A T V Total Weight(Pi)

Value(Xi)

Indicator_type(I1) 7 10 5 1 23 0,175 2

Modified(I2) 1 1 10 10 22 0,17 3

Created(I3) 1 10 1 10 22 0,17 2

Valid_from(I4) 1 5 5 5 16 0,12 2

External_reference(I5) 0 0 0 0 0 0 Empty

Kill_chain_phase(I6) 7 5 10 1 23 0,175 5

Pattern(I7) 0 0 0 0 0 0 Empty

OSINT_source(I8) 0 0 0 0 0 0 Empty

Source_type(I9) 10 5 5 5 25 0,19 1

The Threat Score for this heuristic is therefore computed as: TS(I) = 6/9 x(2x0,175+3x0,17+2x0,17+2x0,12+5x0,175+1x0,19)=1,67.

D4.2

7777

5 Summary and Conclusions Thisdeliverablepresentsananalysisofrelatedworkregardingtheapplicationofmachine learning techniques to process and select relevant security-relatedOSINT.Thisanalysisshowsthatmostworkreliesonasetofkeywords,thereforebeing sensitive to the completeness of the keyword set. Research was alsopresented providing evidence that Twitter is an important aggregator of earlyavailablesecurity-relatedOSINTinformation.ThearchitectureofatoolfortheintegrationofpublicIPblacklistsknowledgeispresented.ThistoolwasvalidatedintheenvironmentofoneDiSIEMpartneroveraperiodof5months.This resulted inan increaseof80 security incidentspermonth that required research, and in an increase of 2.57% in precisionwhencomparedtotheexistingapproachrelyingonpublicandprivate/payedblacklists.A Twitter-based system is proposed for threat detectionwhich implements aninformation processing pipeline that collects tweets, filters them based on thespecificationofamonitoredinfrastructure,andclassifytheremainingtweetsaseither relevant or not. Considering the classification stage, a large number ofexperiments were conducted to find the best model architecture andparameterization. This included model design variables and model/learningalgorithmhyper-parameters,consideringSupportVectorMachines(SVM),Multi-LayerPerceptronNeuralNetworksandConvolutionalNeuralNetworks (CNN).Fromthecomparisonofthefirsttwo,anSVMclassifierwasobtainedthatachieveshighTPRandTNR(~90%)whenclassifyingtweetspostedinthefutureofthoseinthetrainingdatasetandbyadditionalusers.TheCNN-based,evaluatedinthesamedatasets,furtherimprovedtheresultswithsignificantlyhigherTNR(94%-97%)andslightlybetterTPR.Aclusteringmethodologyisproposedtodecreasetheamountofinformationthatisproducedattheoutputoftheclassifierstage.Altogether, the classifiers followed by the clustering approach, are able tomaximizetherelevantinformation(TPRof90%),minimizeirrelevantinformation(falsepositiverateof6%),andaggregaterelated information(only20%of therelevanttweetsarepresentedafterclustering).Anarchitecturebasedonthecommerciallistening247platformisalsopresentedfor security-related OSINT discovery. It leverages the combined strength ofontologiesandmachine-learningapproaches.Experimentalresultsaregivenforits noise-filtering and meta-information prediction stages, showing that anoptimised voting ensemble approach is effective for noise filtering and that abagging ensemble of SVMs is effective for predicting meta-information that istypicallyusedbycybersecurityprofessionals.Finally,thearchitectureoftheContext-awareIntelligenceIntegratorcomponentispresented.ThiscomponentgathersdatainastandardisedwayfromtheOSINTdiscovery components and from themonitored infrastructure and computes athreatscorethatenrichesIoCsbeforesharingthemwiththeSIEMs.ThisenablesSOCanalyststoprioritizetheanalysisofincidents.

D4.2

7878

6 References [ARB13]O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. Prez, and I. Perona, “Anextensivecomparativestudyofclustervalidityindices,”PatternRecognition,vol.46,no.1,2013.[CAM13] Rodrigo Campiolo, Luiz Arthur F. Santos, DanielMacêdo Batista, andMarcoAurélioGerosa.2013.EvaluatingtheUtilizationofTwitterMessagesAsaSourceofSecurityAlerts.In28thAnnualACMSymposiumonAppliedComputing.[CHR10]C.Cornelis,P.Victor,E.Herrera-Viedma,“OrderedWeightedAveragingApproachesforAggregatingGradualTrustandDistrust,”inXVSpanishCongressonTechnologysandFuzzyLogicESTYLF,2010.[CLE17] C. Sauerwein, C. Sillaber, A. Mussmann, R. Breu, “Threat IntelligenceSharing Platforms: An Exploratory Study of Software Vendors and ResearchPerspectives,”inInternationalConferenceonWirtschaftsinformatik,2017.[CMY17]Chao Zhao, Min Zhao, Yi Guan. Classification of entities via theirdescriptivesentences.ArXiv:1711.10317,2017.[CSI18]CSIRTGadgetsFoundation,“CSIRTGadgetsMakingtheInternetabetterplace,”[Online].Available:http://csirtgadgets.org/.[AccessedFebruary2018].[COR95]C.CortesandV.Vapnik, “Support-vectornetworks,”Machine learning,vol.20,no.3,1995.[DiS21] DiSIEM Consortium, “In-depth analysis of SIEMs extensibility". DiSIEMProjectdeliverable2.1,February2017.[DiS22] DiSIEM Consortium, “Reference architecture and integration plan".DiSIEMProjectDeliverable2.2,August2017.[DiS41] DiSIEM Consortium, “Techniques and Tools for OSINT-based Threat".DiSIEMProjectDeliverable4.1,August2017.[ENI14]ENISA,“ActionableInformationforSecurityIncidentResponse,”2014.[FLO16]F.Skopik,G.Settanni,R.Fiedler,“Aproblemsharedisaproblemhalved:A survey on the dimensions of collective cyber defense through securityinformationsharing,”Computers&Security,vol.60,pp.154-176,2016.[GEN15] X. Geng and K. Smith-Miles, “Incremental learning,” Encyclopedia ofbiometrics,2015.[GUY09] I.Guyon,U.VonLuxburg,andR.C.Williamson,“Clustering:Scienceorart,”inNIPS2009workshoponclusteringtheory,2009.

D4.2

7979

[HOV12]Hovsepyan,A.,Scandariato,R.,Joosen,W.,&Walden,J.(2012).Softwarevulnerability prediction using text analysis techniques. Proceedings of the 4thInternationalWorkshoponSecurityMeasurementsandMetrics-MetriSec’12,7.https://doi.org/10.1145/2372225.2372230[IMA14] iMatix Corporation, “DistributedMessaging - zeromq,” 2014. [Online].Available:http://zeromq.org/.[AccessedFebruary2018].[JAI10]A.K.Jain,“Dataclustering:50yearsbeyondK-means,”PatternRecognitionLetters,vol.31,no.8,Jun.2010.[KIM14] Yoon Kim. Convolutional neural networks for sentence classification.ArXiv:1408.5882,2014.[KÜH14]Kührer,M.,Rossow,C.,&Holz,T.(2014).Paintitblack:Evaluatingtheeffectivenessofmalwareblacklists.In:StavrouA.,BosH.,&PortokalidisG.(eds)Research in Attacks, Intrusions and Defenses. RAID 2014, pages 1-21. LectureNotesinComputerScience,vol8688.Springer,Cham[KIM14] Yoon Kim. Convolutional neural networks for sentence classification.arXivpreprintarXiv:1408.5882,2014.[KÜH14]Kührer,M.,Rossow,C.,&Holz,T.(2014).Paintitblack:Evaluatingtheeffectivenessofmalwareblacklists.[LES14]J.Leskovec,A.Rajaraman,andJ.D.Ullman.Miningofmassivedatasets.CambridgeUniversityPress,2014.[LIA16]X.Liao,K.Yuan,X.Wang,Z.Li,L.Xing,andR.Beyah.Acingtheiocgame:Toward automatic discovery and analysis of open-source cyber threatintelligence.InProceedingsofthe2016ACMSIGSACConferenceonComputerandCommunicationsSecurity,pages755–766.ACM,2016.[MAC67] J. MacQueen, “Some methods for classification and analysis ofmultivariateobservations,”in5thBerkeleySymposiumonMathematics.StatisticsandProbability,1967.[MAT12] M. L. Mathews, P. Halvorsen, A. Joshi, and T. Finin. A collaborativeapproachtosituationalawarenessforcybersecurity.InCollaborativeComputing:Networking, Applications and Worksharing (CollaborateCom), 2012 8thInternationalConferenceon,pages216–222.IEEE,2012.[MIC18] Micro Focus, “SIEM, Enterprise Security Information and Eventmanagement System,” [Online]. Available: https://software.microfocus.com/en-us/products/siem-security-information-event-management/overview.[AccessedFebruary2018].

D4.2

8080

[MIK13] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013).Distributedrepresentationsofwordsandphrasesandtheircompositionality.InAdvancesinneuralinformationprocessingsystems(pp.3111-3119).[MIS] MISP,“MISP-UserGuideAThreatIntelligencePlatform”.[MIS18] MISP, “MISP - Open Source Threat Intelligence Platform & OpenStandards For Threat Information Sharing,” [Online]. Available:http://www.misp-project.org.[AccessedFebruary2018].[MIT14] MITRE Corporation, “Collaborative Research Into Threats,” 2014.[Online].Available:https://crits.github.io/.[AccessedFebruary2018].[MIT16]S.Mittal,P.K.Das,V.Mulwad,A.Joshi,andT.Finin,“CyberTwitter:UsingTwitter to generate alerts for Cybersecurity Threats and Vulnerabilities,” inInternational Symposium on Foundations of Open Source Intelligence andSecurityInformatics.IEEEComputerSociety,2016.[MIT17]Mittal,S.,Joshi,A.,&Finin,T.(2017).Thinking,FastandSlow:CombiningVector Spaces and Knowledge Graphs. Retrieved fromhttp://arxiv.org/abs/1708.03310[MOH17]A.Mohaisen,O.Al-Ibrahim,C.Kamhoua,K.Kwiat,L.Njilla,“Rethinkinginformationsharingforthreatintelligence,”2017.[NIC15]Nickel,M.,Murphy,K.,Tresp,V.,&Gabrilovich,E. (2015).AReviewofRelational Machine Learning for Knowledge Graphs, 1–23.https://doi.org/10.1109/JPROC.2015.2483592[NUN16]Nunes,E.,Diab,A.,Gunn,A.,Marin,E.,Mishra,V.,Paliath,V.,…Shakarian,P. (2016). Darknet and Deepnet Mining for Proactive Cybersecurity ThreatIntelligence,1–6.Retrievedfromhttp://arxiv.org/abs/1607.08583[OAS18] OASIS, “Introduction to STIX,” OASIS, 2017. [Online]. Available:https://oasis-open.github.io/cti-documentation/stix/intro. [Accessed January2018].[OAS181] OASIS, “Introduction to TAXII,” [Online]. Available: https://oasis-open.github.io/cti-documentation/taxii/intro.[AccessedFebruary2018].[QUE17] Queiroz, A., Keegan, B., & Mtenzi, F. (2017). Predicting softwarevulnerabilityusingsecuritydiscussioninsocialmedia.16thEuropeanConferenceon Cyber Warfare and Security, ECCWS 2017, 628–634. Retrieved fromhttps://www.scopus.com/inward/record.uri?eid=2-s2.0-85027995432&partnerID=40&md5=7c9399c8e40b02319c09b9f127ccdcd2[RIT15] A. Ritter, E. Wright, W. Casey, and T. Mitchell, “Weakly supervisedextractionofcomputersecurityeventsfromtwitter,”inProceedingsofthe24thInternationalConferenceonWorldWideWeb.ACM,2015.

D4.2

8181

[ROS58] F. Rosenblatt, “The perceptron: A probabilisticmodel for informationstorageandorganizationinthebrain.”Psychologicalreview,vol.65,no.6,1958.[RUM85] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internalrepresentationsbyerrorpropagation,”DTIC,Tech.Rep.,1985.[SAB15]C.Sabottke,O.Suciu,andT.Dumitra,“Vulnerabilitydisclosureintheageof social media: exploiting twitter for predicting real-world exploits,”in 24thUSENIXSecuritySymposium(USENIXSecurity15),2015.[SAD11]S.Derakhshandeh,N.Mikaeilvand, “FuzzyMethod for IdentificationofAggregate Weights in Ordered Weighted Averaging Operators,” Middle-EastJournalofScientificResearch,vol.7(3),pp.293-295,2011.[SAP17]Sapienza,A.,Bessi,A.,Damodaran,S.,Shakarian,P.,Lerman,K.,&Ferrara,E. (2017). EarlyWarnings of Cyber Threats in Online Discussions. 2017 IEEEInternational Conference on Data Mining Workshops (ICDMW), 667–674.https://doi.org/10.1109/ICDMW.2017.94[SIN08] Sinha, S., Bailey, M., & Jahanian, F. (2008). Shades of Grey: On theeffectiveness of reputation based black-lists. Proceedings of the InternationalConferenceonMaliciousandUnwantedSoftware(Malware),pages57–64.[SOL18]SoltraEdge, “Soltra |CyberThreat Intelligence&Data |CyberDefensePlatform,”[Online].Available:https://www.soltra.com/en/.[AccessedFebruary2018].[SRI09] S. D. Ravana, A. Moffat, “Score Aggregation Techniques in RetrievalExperimentation,”inTwentiethAustralasianDatabaseConference,2009.[TRA15]Trabelsi,Slim,etal."Miningsocialnetworksforsoftwarevulnerabilitiesmonitoring." New Technologies, Mobility and Security (NTMS), 2015 7thInternationalConferenceon.IEEE,2015.[THR15] ThreatConnect, “THREAT INTELLIGENCE PLATFORMS EverythingYou’veEverWantedtoKnowButDidn’tKnowtoAsk,”2018.[Online].Available:https://www.threatconnect.com/wp-content/uploads/ThreatConnect-Threat-Intel-Platform-ebook.pdf.[AccessedFebruary2018].[THR17]ThreatConnect,“SIEM+ThreatIntelligence:QuicklyIdentifytheThreatsthat Matter to You,” [Online]. Available: https://www.threatconnect.com/wp-content/uploads/ThreatConnect-SIEM-Threat-Intelligence-Whitepaper.pdf.[AccessedDecember2017].[THR18]ThreatConnect,“ThreatIntelligence,Analytics,andOrchestrationinOnePlatform,” [Online]. Available: https://www.threatconnect.com/. [AccessedFebruary2018].

D4.2

8282

[TIB01]R.Tibshirani,G.Walther,andT.Hastie,“Estimatingthenumberofclustersinadatasetviathegapstatistic,”JournaloftheRoyalStatisticalSociety:SeriesB(StatisticalMethodology),vol.63,no.2,2001.[VIC07] V. Torra , Y. Narukawa, Modeling Decisions: Information Fusion andAggregationOperators,Springer-VerlagBerlinHeidelberg,2007.[VIC17] V. Torra, “Aggregation functions and information fusion. Modelingdecisions,”2017.[Online].Available:http://www.mdai.cat/ifao/slides/transparencies.SFLA.2017.pdf. [AccessedFebruary2018][WEI09] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg,“Featurehashing for largescalemultitask learning,” inProceedingsof the26thAnnualInternationalConferenceonMachineLearning,2009.[WIE18]W.Tounsi,H.Rais,“Asurveyontechnicalthreatintelligenceintheageofsophisticatedcyberattacks,”Computers&Security,vol.72,pp.212-233,2018.[ZAK14] M. J. Zaki, W. Meira Jr, and W. Meira, Data mining and analysis:fundamentalconceptsandalgorithms.CambridgeUniversityPress,2014.[ZHU16] Z. Zhu and T. Dumitras, “FeatureSmith: Automatically EngineeringFeaturesforMalwareDetectionbyMiningtheSecurityLiterature,”inProceedingsofthe2016ACMSIGSACConferenceonComputerandCommunicationsSecurity.ACM,2016.

D4.2

8383

List of Acronyms Acronym DescriptionACDC AdvancedCyberDefenceCenterANN ArtificialNeuralNetworksCVE CommonVulnerabilitiesandExposuresCSIRTs ComputerSecurityIncidentResponseTeamsCVSS CommonVulnerabilityScoringSystemCTI CyberThreatIntelligenceDDOS DistributedDenialofServiceENISA EuropeanAgencyforNetworkandInformationSecurityEM Expectation-MaximizationGRU GatedRecurrentNeuralNetworksHUMINT HumanIntelligenceIODEF IncidentObjectDescriptionandExchangeFormatIoC IndicatorsofCompromiseIPS IntrusionPreventionSystemsIPA JapaneseInformation-technologyPromotionAgencyLDA LatentDirichletAllocationLSTM LongShort-TermMemoryNVD NationalVulnerabilityDatabaseISACcouncil NationalCouncilofInformationSharingandAnalysisCenterNLP NaturalLanguageProcessingOpenDXL OpenDataeXchangeLayerOSINT OpenSourceIntelligenceSOC SecurityOperationCenterSOCMINT SocialMediaIntelligenceSaaS SoftwareasaServiceSDO STIXDomainObjectsSRO STIXRelationshipObjectsSTIX StructuredThreatInformationeXpressionSVM SupportVectorMachinesTTPs Tactics,TechniquesandProceduresTC TechnicalCommitteeTF TermFrequencyTABI TrustAssessmentofBlacklistsInterfaceTAXII TrustedAutomatedeXchangeofIndicatorInformationDHS U.S.DepartmentofHomelandSecurity

D4.2 OSINT data fusion and analysis...

Documents

Transcript of D4.2 OSINT data fusion and analysis...