Post on 18-Jul-2020
ControllingLeakageandDisclosureRiskinSeman6cBigDatapipelines
Ernesto Damiani (joint work with Paolo Ceravolo)
Outline
• Introduc.on• PrerequisitesandVision• NewBigDataThreats• SomeideasforaKNOW,PREVENTDETECT,COUNTERparadigmcounterthem.
BIG DATA INITIATIVE
Driveopenresearch&innova6oncollabora6onwithUAEandinterna6onalins6tutesandorganisa6onstocarryworldleadingresearchanddelivertangiblevalue,training,knowledgetransferandskillsdevelopmentinlinewiththeUAE
strategicpriori6esintheareasof:Smartenterprise,smartinfrastructure&smartsociety
Security Research CenterSECURITYOFTHEGLOBALICTINFRASTRUCTURENetworkandCommunica.onsSecurityBusinessProcessSecurityandPrivacySecurityandPrivacyofBigDataPlaJormsSECURITYASSURANCESecurityRiskAssessmentandMetricsCon.nuousSecurityMonitoringandTes.ngDATAPROTECTIONANDENCRYPTIONHighPerformanceHomomorphicEncryp.onLightweightCryptographyandMutualAuthen.ca.on
SESARLAB• SecureSoOwareArchitecturesandKnowledge-basedsystemslab(SESAR)hTp://sesar.d..unimi.it
• Located on the new campus in Crema, 40 km south-east of Milan • Industry collaborations: SAP, British Telecom Nokia Siemens, Cisco, Telecom Italia • Part of the BigData Community
Someac.vi.es
• BigDataisnotjustatechnologicaladvancebutrepresentsaparadigmshiOinextrac6ngvaluefromcomplexmul6-partyprocesses
Vision
FromclassicdatawarehousetoBigData
Internalvs.Externaldatasources
ProcessingModels
• Batchvsstreaming• Hashvssketch
DataModels
• DATAMODELS:• Non-rela.onal(aTribute-value)
• Extendedrela.onal(columnorrow-par..oned)
• Neo-rela.onal(hybrid)• LargeDataSharingInfrastructuretofeedMRcomputa6ons
DesigningDataRepresenta.onsforBigDataApplica.ons
• Designastheyteachyouatschool• Scaleup->DenormalizeInstance(dropindexes,triggers)
• Solveproblemswithread/writeprecedence->Createwrite-toandread-fromdatareplicas(keepconsistencyperiodically)
• MemcachetheDenormalizedInstance->(looseACID)
Rela.onaldenormaliza.onrefresher
• Simpleconcept:flaTenarepea.nggroupinasingletable
• InsteadofEMP (E#, D#, Ename) - DEPT(D#, DEPT, Address)
• UseEMP (E#, Ename, DEPT, Address)
Denormalization backsides
• Makes rows longer -> longer data transfers
• Needs more RAM for in-memory processing
• Redundant relationships improve performance at the expense of update overhead
MemcacheTypicalusage:public Data readData (String query) {
Data answer= memcache.execute(query); if (answer== null) { answer= database.read(query); memcache.write(answer); } return answer;
}
Low-levelrepresenta.on
• Key-valuedata-stores• Persistent,distributed(key,value)maps
• Organizedinregionsheldbydifferentservers
• Everyen.tyisasetofkey-valuepairs
Key-valuereminder• Akeyhasmul.plecomponents,specifiedasanorderedlist.– Themajorkeyiden.fiestheen.tyandconsistsoftheleadingcomponentsofthekey.
– Thesubsequentcomponentsarecalledminorkeys.Thisorganiza.onissimilartoadirectorypathspecifica.oninafilesystem(/Major/minor1/minor2/).
• The“value”partofthekey-valuepairisanuninterpretedstringofbytesofarbitrarylength
Example“Employee” : {
“Data” : { “EmpID”: “anyByteArray” “Photo” : “anyByteArray” “DeptID” : “anyByteArray”
REGION 1 } “Department” : {
“DeptID” : “anyByteArray” “DeptDescription” : “anyByteArray” }
REGION 2
This is a key, !not a column name!
DenormalizedExample“Employee” : {
“EmpData” : { “Photo” : “anyByteArray” “EmpID” : “anyByteArray” } “DeptData” : { “Description” : “anyByteArray” “DeptID” : “anyByteArray” “DeptLocation” : “anyByteArray” } }
REGION 1
Consensus
• Sincedataitemsarereplicated,opera.onscanbeaTemptedconcurrentlyonreplicas
• Synchroniza6onusingleaderelec6on(Paxos)• Features
– Reliabilityandavailability– easy-to-understandseman.cs– performance,throughput,acceptablelatency
• hTp://labs.google.com/papers/chubby-osdi06.pdf
Data Batch processing: Map/Reduce
• Map/Reduce is a programming model for efficient distributed computing
• It works like a Unix pipeline: – cat input | grep | sort | uniq -c | cat
> output – Input | Map | Shuffle & Sort | Reduce |
Output • Efficiency from
– Data routing based on keys, reducing seeks – Pipelining
• A good fit for a lot of applications – Log processing – Web index building
Prac.calMapReduce=HDFS+Hadoop
Locality optimizations Map-Reduce queries HDFS for locations of input data Map tasks are scheduled close to the inputs when possible
RiskandThreats
RiskComponents
BigDataThreats:Breach• IntermsoftheISO15408model,adatabreachoccurs
when“adigitalinforma6onassetisstolenbya8ackersbybreakingintotheICTsystemsornetworkswhereitisheld/transported”
• BigDataBreach:theOofaBigDataassetexecutedbybreakingintotheICTinfrastructureofacollector,transformer,processororuserwhoholdsit.– ManyaTacksdocumentedinthefieldcanbeclassifiedasBigDataBreachesinvolvingDataSourceassets
– 2014Targetdatabreachinvolved40milliondebitandcreditcardnumbers.
• aBigDataBreachrequirespro-ac.vehos.lebehavior(thebreak-in)
BigDataThreats:Leak
• BigDataLeakcanbedefinedasthe(totalorpar.al)disclosureofaBigDataAssetatacertainstageofitslifecycle.– ABigDataLeakcanhappenwhenBigDataare(unwillingly)disclosedbytheownertotheproviderofanoutsourcedprocess,e.g.compu.ngdataanaly.cs.
• IntermsoftheaTackermodel,BigDataLeakcanbeexploitedevenbyahonest-but-curiousaTacker.
BigDataThreats:Degrada.on
• BigDataDegrada6oncanbedefinedasinjec.onofdoctoredversionofaBigDataAssetatacertainstageofitslifecycle.– BigDataDegrada.oncanhappenwhenBigDataarepoisonedbytheproviderofanoutsourcedprocess,e.g.compu.ngdataanaly.cs.
• IntermsoftheaTackermodel,BigDataDegrada.onrequirespro-ac.vehos.lebehavior(theinjec.on),
BigDataThreatsasAPTs
• Anadvancedpersistentthreat(APT)isasetofstealthyandcon.nuousprocesses,oOenorchestratedbyhuman(s)targe.ngaspecificen.ty.– ”Advanced”:signifiessophis.catedtechniquesusingmalwaretoexploitvulnerabili.esinsystems.
– ”Persistent”con.nuouslymonitoringandextrac.ngdatafromaspecifictarget.
TheSilosproblem
• Different data are held by different departments
• Representation and processing choices were made independently and may conflict
• Regulatory differences in collection and usage may make merging a challenge
• Early merge, late merge or never merge?
Datarepresenta.on
• Theimplica.onondatamodellingandseman.cshavebeenmasterfullydiscussedinseveralworks...
• HoweverlessaTen.onhasbeendevotedtaspectsthatareusuallysecondaryincentralizedapproaches
• Oneoftheseaspectsistheimplica.onofpre-injec.onofJoinforDataLossPreven.on...
ESWC2016
BreakingtheSilos
Tradeoffs• Atinges.on.metwotradoffsmustbemade:
– I/OperrequestvsTotalDataVolume-Denormaliza.on,ifdonewell,bringsmorelocalitytodataandtheamountofI/Operrequestdecreases.
• Anormalizedrela.onalstorehastoquerymul.pletablestofulfilleachrequest,leadingtonon-localizedfetches
• Non–localizedfetchesleadingtomoreI/O,aseachfetchrequirstoareadandeachreadhasa“blocksize”minimum.
– ProcessingComplexityvsTotalDataVolume–• Non–localizedfetchesarefollowedbyassemblingopera.onsthatrequireCPU.me.
• Denormalizeddataprocessingissimpler,butatthecostofincreasedtotaldatavolumeinthestore.
TransparentDe-normaliza.on
• BigDatatoolssupporttransparentdenormalisingatdatainges.on.me.
• TheuserofaBigDatacomputa.onmaywellignore1. thenumberofreplicasatrun.me2. TheRegionbordercrossingsgeneratedat
inges.on.meforefficiencyreasons.
De-normaliza.ongrayarea
AnalyticsAlgorithms
AvailableObservation
Space
Context
GrayArea
Degrada.onviafaultyvalues
Source:[13]withthanks
Needforade-normaliza.onindex• The“grayarea”isacri.calissueforBigDataLeakpreven.on,especiallyforBig-Data-as-a-Service
• Thegloballikelihoodofexposureofdatainthegrayareacanbees.matedviaaBigDatastorage’sdegreeofde-normaliza6on,orD-index[11]– (Normalized)medianofthenumberofreplicasperdataitemheldintheBigDatastorageduringareference.meintervalΔ
• Measurableviatrustedprobes[12],morelater
FromD-indextodisclosureprobability(1)
• TheD-indexisseenasa“propensionfactor”todisclosure
• Intui.vely,itmeasurestheoverall“unrequestedtrips”thatdataitemsvalueshavedonetotheneighborhoodsofotherrelateddataitemsjustbecausethe“fuelprice”,i.e.thestoragecostintheBigDatasystem,islow.
FromD-indextodisclosureprobability(2)
• TheD-indexitselfcannotbedirectlyiden.fiedasaprobability
• Although,beingnormalized,itsvaluefallsinthe[0,1]interval,itlackssomeformalproper.eswewouldexpectfromalikelihood(forinstance,thereisnorela.onlinking(1-Dindex)andtheintegrityofthedataspace).
• AformalmappingprocedurecanbedevisedtoturntheD-indexintoarigorousprobabilityorpossibilitymeasure[6],[7].
Needforanaccrualconsensusindex
• Foreachdataitemi,ΦiisthenormalizednumberofupdatesthatoriginatedeachvalueoficurrentlyheldintheBigDatastorage
• Measuresthebasisfortheconsensusthatoriginatedeachvalue– Smallconsensusbasis->higherlikelihoodofthedataitemdegrada.on
• InspiredtoCassandrafailureindex[14]• Measurableviaatrusteddetectorthatoutputsavalue,Φi,associatedwitheachitem.
Independentinterpreta.onsofΦ
Source:[14]
LeakvsBreachvsDegrada.onrevisited
• BigDataBreach:adversarybreaksintothesystemandsees(a)allavailabledatasourcesand(b)theinternalstateoftheBigDatasystem.– Nosilosboundaries:fullplayground!
• BigDataLeak:adversarycollaboratestothecomputa.onofanaly.csandtakesadvantageofde-normaliza.ontoaTractinforma.oninregions
• BigDataDegrada6on:honest-but-curiousadversarieswilljustpeek,butamaliciousaTackercoulddoctorherownorotherpeople’sdata,leadingtowrongdecisionswhichmaycausepermanentdamage.
.
42
Someideas
• Systema.cstudyofBigDataSecurityprac.cesiss.llinitsinfancy.
• Organizebestprac.cesaroundtheworkontop-levelcybersecurityfunc.onsongoingatNIST(availableathTp://www.nist.gov/itl/upload/draO_framework_core.pdf)– Closelybasedonfunc.onssuggestedbypubliccomments.
• Thesefunc.onsareKnow,Prevent,Detect,Respond,andRecover.
Aprac.calexample
Datastructure
From https://en.wikipedia.org/wiki/K-anonymity
• Thisdatahas2-anonymitywithrespecttotheaTributes'Age','Gender'and'Stateofdomicile'sinceforanycombina.onoftheseaTributestherearealwaysatleast2rowswiththoseexactaTributes.
• TheaTributesavailabletoanadversaryarecalled"quasi-iden.fiers".Each"quasi-iden.fier"tupleoccursinatleastkrecordsforadatasetwithk-anonymity.
Datastructure
ESWC2016
From https://en.wikipedia.org/wiki/K-anonymity
Ourdatasetinneo4j
AchievingthedesiredK-anonymity• Therearetwocommonmethodsforachievingk-anonymityforsomevalueofk:
• Suppression:inourexampleweremovename• Generalisa.on:inourexampleagevaluescanbesubs.tutedwitharange
• Butthek-anonymitylevelofagivensubsetofdataselectedbyaquerydependsontwofactors:– theObfusca.oncreatedbySuppressionandGeneralisa.onofsomeaTributesintheoriginaldataset
– theSegmenta.onofthequeryresult
ESWC2016
Segmenta.on• Supposewesubmitaquerywhich
specifiesthegenderandarangefortheage:
• Theresulthas2-anonymityw.r.t.Domicile;1-anonymityw.r.t.ReligionandDisease-guessingthevalueofthelaTeraTributeswilliden.fythepa.ent.
MATCH (s:User), (d:Domicile), (r:Religion), (e:Disease), (s)-[q2:REL]->(r), (s)-[q1:REL]->(d), (s)-[q3:REL]->(e) WHERE toInt(s.age) < 25 AND s.gender = "Female”RETURN (s)-[]-();
ESWC2016
Problem
• InBigDatastorage,amalicioususerextrac.ng/inspec.ngaregion=selec.onasubsetofdata
• Segmenta.onofBigdataregionsisdifficulttocontrol
• Inferencesarepossible
Apossiblecountermeasure:RedundantRela.ons
• Addingredundantrela.onswecanlimittheeffectofSegmenta.on
MATCH (s:User), (e:Disease)WITH COLLECT(e) AS Disease, sFOREACH (e2 in Disease |CREATE (s)-[q3:REL {context: "4321"}]->(e2))RETURN (s)-[]-();
ESWC2016
Secret• Thesecretisa
contextualiza.onindexthatcountersignstherela.onshipthatwasoriginatedfromthetruedatasourceandnotforredundancy.Inourexample:
MATCH (s:User), (d:Domicile), (r:Religion), (e:Disease), (s)-[q2:REL]->(r), (s)-[q1:REL]->(d), (s)-[q3:REL {context: "1234"}]->(e) WHERE toInt(s.age) < 25 AND s.gender = "Female" RETURN (s)-[q3]-(e), (s)-[q1]-(d), (s)-[q2]-(r);
ESWC2016
Notapanaceaw.r.t.distribu.onschecks
• ATackercanstudythedistribu.onsamongthecontextualrela.onships
MATCH (s:User)-[q:REL {context: "1234"}]->(e:Disease)RETURN id(s), Count(e) AS Relationships;
ESWC2016
Hashing
• Allrela.onshipsaremarkedwiththesamecontext
• Anhashindexiscreatedoverthetriple: (s)-[REL]-(e)• Thehashfunc.onisthesecret
– Given(s)and(e)nodesweknowiftherela.onisintheoriginaldataset
ESWC2016
Technologycannotdoitalone(1)
• Theopportunis.cone-shotaTackstypicaloftheearlydaysofBigDatahavebeensupplementedbyleakagesthataremorepersistentand,insomecases,moreworrisome.
• WeneedtostartdesigningBigDatasystemsnotjusttopreventaTacksandrecoverfromthem,butalsotodetectsuccessfulaTackersquicklyandcontainthemsothatanydataleakagecanbeiden.fiedandcountered.
References[1]HesmanSaey,T.,“BigData,BigChallenges”,ScienceNews,February7,2015[2]Chi,Guangqing,JeremyR.Porter,ArthurG.Cosby,andDavidLevinson.2013."TheImpactofGasolinePriceChangesonTrafficSafety:ATimeGeographyExplana.on."JournalofTransportGeography28(1):1–11.[3]BellandiV.,CimatoS.,DamianiE.,GianiniG.andZilli,A.“TowardsEconomics-AwareRiskAssessmentonTheCloud”,IEEESecurityandPrivacy,toappearinNovember2015[4]Demirkan,H.,&Delen,D.(2013).Leveragingthecapabili.esofservice-orienteddecisionsupportsystems:Puznganaly.csandbigdataincloud.DecisionSupportSystems,55(1),412-421.[5]Damiani,E.,Oliboni,B.,&Tanca,L.(2001).FuzzytechniquesforXMLdatasmushing.InComputa.onalIntelligence.TheoryandApplica.ons(pp.637-652).SpringerBerlinHeidelberg.[6]Damiani,E.,Cimato,S.,&Gianini,G.(2014).“Ariskmodelforcloudprocesses”.TheISCInterna.onalJournalofInforma.onSecurity,6(2),99-123.[7]Bellandi,V.,Cimato,S.,Damiani,E.,&Gianini,G.(2015).“Possibilis.cassessmentofprocess-relateddisclosurerisksinthecloud”.InW.Pedryczetal.,eds.,Computa.onalIntelligenceandQuan.ta.veSoOwareEngineering.Springer-Verlag,2014[8]Chen,M.,Mao,S.,Zhang,Y.,&Leung,V.C.(2014).Bigdatastorage.InBigData(pp.33-49).SpringerInterna.onalPublishing.[9]Forbes,“BigDataBreachesof2014”,availableathTp://www.forbes.com/sites/moneybuilder/2015/01/13/the-big-data-breaches-of-2014/,2015.[10]B.Biggio,B.Nelson,P.Laskov“PoisoningATacksagainstSupportVectorMachines”,Proceedingsofthe29thInterna.onalConferenceonMachineLearningEdinburgh,Scotland,UK,2012[11]E.Damiani,TowardBigDataLeakAnalysis,ProceedingsofIEEEPSBD2015,SanJosè,CA,2015[12]ClaudioAgos.noArdagna,RasoolAsal,ErnestoDamiani,QuangHieuVu:OntheManagementofCloudNon-Func.onalProper.es:TheCloudTransparencyToolkit.NTMS2014:1-4[13]SantoshAditham,NagarajanRanganathan,ANovelFrameworkforMi.ga.ngInsiderATacksinBigDataSystems,ProceedingsofIEEEPSBD2015,SanJosè,CA,2015[14]NaohiroHayashibara,XavierDéfago,RamiYared,andTakuyaKatayama,TheϕAccrualFailureDetector,JSTIS-RR-2004-010