Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010...
Transcript of Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010...
BigDataLearninginPrac.ce
12thSeptember2016
IsaacTrigueroSchool of Computer Science
University of Nottingham United Kingdom
Isaac.Triguero@no:ngham.ac.ukh<p://www.cs.no<.ac.uk/~pszit/benelearn.html
Outline
q WhatisBigdata?q HowtodealwithDataIntensiveapplicaKons?q BigDataAnalyKcsq AdemowithMLlibq Conclusions
2
Thereisnoastandarddefini.on!
“BigData”involvesdatawhosevolume,diversityandcomplexityrequiresnewtechniques,algorithmsandanalysestoextractvaluableknowledge(hidden).
WhatisBigData?
DataIntensiveapplica.ons
3
WhatisBigData?The5V’sdefiniKon
4
Bigdatahasmanyfaces
5
Outline
q WhatisBigdata?q HowtodealwithDataIntensiveapplicaKons?q BigDataAnalyKcsq AdemowithMLlibq Conclusions
6
• Problemstatement:scalabilitytobigdatasets.• Example:
– Explore100TBby1node@50MB/sec=23days– ExploraKonwithaclusterof1000nodes=33minutes
• Solu.onèDivide-And-Conquer
HowtodealwithdataintensiveapplicaKons?
Whathappensifwehavetomanage1000or10000TB?
7
MapReduce
• ParallelProgrammingmodel• Divide&conquerstrategy
§ div ide : parKKon dataset into smal ler ,independent chunks to be processed in parallel(map)
§ conquer:combine,mergeorotherwiseaggregatetheresultsfromthepreviousstep(reduce)
• Based on simplicity and transparency to theprogrammers,andassumesdatalocality.• Becomespopularthankstotheopen-sourceprojectHadoop!(UsedbyGoogle,Facebook,Amazon,…)
8
TradiKonalHPCwayofdoingthings
workernodes
(lotsofthem)
…
centralstorage
CommunicaKonnetwork(Infiniband)
NetworkforI/O
OS OS OS OS OS
iiiiii
LimitedI/O
c cc cc
inputdata(relaKvelysmall)
Lotsofcomputa.ons
Lotsofcommunica.on
Source:JanFos.er.Introduc.ontoMapReduceanditsApplica.ontoPost-SequencingAnalysis
Data-intensivejobs
Lowcomputeintensity…
FastcommunicaKonnetwork(Infiniband)
NetworkforI/O
OS OS OS OS OS
a
LimitedcommunicaKon
centralstorage
inputdata(lotsofit)b c d e
f g h i ja b c d ef g h i j
LotsofI/O
doesn’tscale
Data-intensivejobs
Lowcomputeintensity
…
CommunicaKonnetwork
LimitedcommunicaKon
inputdata(lotsofit)
e jb c
g ja c
h ib e
g id f
f ha d
Solu.on:storedataonlocaldisksofthenodesthatperformcomputaKonsonthatdata(“datalocality”)
Hadoop
h<p://hadoop.apache.org/
• Hadoopis:– Anopen-sourceframeworkwri<eninJava– Distributedstorageofverylargedatasets(BigData)– Distributedprocessingofverylargedatasets
• Thisframeworkconsistsofanumberofmodules– HadoopCommon– HadoopDistributedFileSystem(HDFS)– HadoopYARN–resourcemanager– HadoopMapReduce–programmingmodel
12
• Automa.cparalleliza.on:– DependingonthesizeoftheinputdataètherewillbemulKpleMAPtasks!
– DependingonthenumberofKeys<k,value>ètherewillbemulKpleREDUCEtasks!
• Scalability:– Itmayworkovereverydatacenterorclusterofcomputers.
• Transparentfortheprogrammer– Fault-tolerantmechanism.– AutomaKccommunicaKonsamongcomputers
HadoopMapReduce:MainCharacterisKcs
13
DataSharinginHadoopMapReduce
iter.1 iter.2 ...
Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
Input
query1
query2
query3
result1
result2
result3
...
HDFSread
SlowduetoreplicaKon,serializaKon,anddiskIO
14
ParadigmsthatdonotfitwithHadoopMapReduce
• DirectedAcyclicGraph(DAG)model:– TheDAGdefinesthedataflowoftheapplicaKon,andtheverKcesofthegraphdefinestheoperaKonsonthedata.
• Graphmodel:– Morecomplexgraphmodelsthatbe<errepresentthedataflowoftheapplicaKon.
– Cyclicmodels->IteraKvity.• Itera.veMapReducemodel:
– AnextentedprogrammingmodelthatsupportsiteraKveMapReducecomputaKonsefficiently.
15
GIRAPH(APACHEProject)(h<p://giraph.apache.org/)Itera8vegraphprocessing
GPS-AGraphProcessingSystem,(Stanford)h<p://infolab.stanford.edu/gps/Amazon'sEC2
DistributedGraphLab(CarnegieMellonUniv.)h<ps://github.com/graphlab-code/graphlabAmazon'sEC2
HaLoop (UniversityofWashington)
h<p://clue.cs.washington.edu/node/14h<p://code.google.com/p/haloop/Amazon’sEC2
Twister(IndianaUniversity)h<p://www.iteraKvemapreduce.org/PrivateClusters
PrIter(Universityof Massachuse<sAmherst, NortheasternUniversity-China)h<p://code.google.com/p/priter/PrivateclusterandAmazonEC2cloud
GPUbasedplauormsMarsGrex
Spark(UCBerkeley)h<p://spark.incubator.apache.org/research.html
NewplauormstoovercomeHadoop’slimitaKons
16
Bigdatatechnologies
17
WhatisSpark?
Efficient
• GeneralexecuKongraphs• In-memorystorage
Usable
• RichAPIsinJava,Scala,Python
• InteracKveshell
Fast and Expressive Cluster Computing !Engine Compatible with Apache Hadoop
2-5×lesscode
Upto10×fasterondisk,100×inmemory
18
SparkGoal• ProvidedistributedmemoryabstracKonsforclusterstosupportappswithworkingsets
• RetaintheaZrac.veproper.esofMapReduce:– Faulttolerance(forcrashes&stragglers)– Datalocality– Scalability
Ini.alSolu.on:augmentdataflowmodelwith“resilientdistributeddatasets”(RDDs)
19
RDDsinDetail
• AnRDDisafault-tolerantcollecKonofelementsthatcanbeoperatedoninparallel.
• TherearetwowaystocreateRDDs:– ParallelizinganexisKngcollecKoninyourdriverprogram
– Referencingadatasetinanexternalstoragesystem,suchasasharedfilesystem,HDFS,Hbase.
• Canbecachedforfuturereuse
20
OperaKonswithRDDs• TransformaKons(e.g.map,filter,groupBy,join)
– LazyoperaKonstobuildRDDsfromotherRDDs• AcKons(e.g.count,collect,save)
– ReturnaresultorwriteittostorageTransformations(defineanewRDD)
mapfiltersampleuniongroupByKeyreduceByKeyjoincache…
Paralleloperations(returnaresulttodriver)
reducecollectcountsavelookupKey…
21
Sparkvs.hadoop
22
274
157
106
197
121
87
143
61
33
0
50
100
150
200
250
300
25 50 100
Iter
atio
n ti
me
(s)
Number of machines
Hadoop
HadoopBinMem
Spark
K-Means
[Zaharia et. al, NSDI’12]
Lines of code for K-Means
Spark ~ 90 lines –
Hadoop ~ 4 files, > 300 lines
DataFrame(Spark1.3+)-EquivalenttoatableinarelaKonaldatabase(dataframeinR/Python)-AvoidJavaserializaKonperformedbyRDDs.-APInaturalfordeveloperswhoarefamiliarwithbuildingqueryplans(e.g.SQLexpressions).
Datasets(Spark1.6+)- BestofbothDataFrameandRDDs.- FuncKonaltransformaKons(map,flatMap,filter,etc)- SparkSQL’sopKmisedexecuKonengine.
ApacheSpark–newcollecKons
23
h<ps://flink.apache.org/
Flink
24
BigData:TechnologyandChronology
2001-2010
2010-2016
BigData
2001
3V’sGartner
DougLaney2004
MapReduceGoogle
JeffreyDean
2008HadoopYahoo!
DougCufng
2010SparkUBerckeleyApacheSparkFeb.2014
MateiZaharia
2009-2013FlinkTUBerlin
FlinkApache(Dec.2014)Volker
Markl
2010-2016:BigDataAnalyKcs:Mahout,MLLib,…HadoopEcosystemApplicaKonsNewTechnology
25
Outline
q WhatisBigdata?q HowtodealwithDataIntensiveapplicaKons?q BigDataAnalyKcsq AdemowithMLlibq Conclusions
26
Clustering
Recommendation Systems
Classification
Association
Poten.alscenarios
Real Time Analytics/ Big Data Streams
SocialMediaMiningSocialBigData
BigDataAnalyKcs
27
BigDataAnalyKcs:A3generaKonalview
28
Mahout(Samsara)
29h<p://mahout.apache.org/
• FirstMLlibraryiniKallybasedonHadoopMapReduce.• AbandonedMapReduceimplementaKonsfromversion0.9.• Nowadays it is focused on a newmath environment called
Samsara.• ItisintegratedwithSpark,FlinkandH2O• Mainalgorithms:
– StochasKcSingularValueDecomposiKon(ssvd,dssvd)– StochasKcPrincipalComponentAnalysis(spca,dspca)– DistributedCholeskyQR(thinQR)– DistributedregularizedAlternaKngLeastSquares(dals)– CollaboraKveFiltering:ItemandRowSimilarity– NaiveBayesClassificaKon
h<ps://spark.apache.org/mllib/
SparkLibraries
30
AsofSpark2.0
31
h<ps://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/ml/
FlinkML
32
Outline
q WhatisBigdata?q HowtodealwithDataIntensiveapplicaKons?q BigDataAnalyKcsq AdemowithMLlibq Conclusions
33
Demo
• InthisdemoIwillshowtwowaysofworkingwithApacheSpark:– InteracKvemodewithSparkNotebook.– StandalonemodewithScalaIDE.
• AllthecodeusedinthispresentaKonisavailableat:
h<p://www.cs.no<.ac.uk/~pszit/benelearn.html
34
DEMOwithSparkNotebookinlocal
h<p://spark-notebook.io/
35
DEMOwithSparkNotebookinlocal
36
DEMOwithSparkNotebookinlocal
37
DEMOwithSparkNotebookinlocal
38
DEMOwithSparkNotebookinlocal
39
DEMOwithSparkNotebookinlocal
40
DEMOwithSparkNotebookinlocal
41
DEMOwithSparkNotebookinlocal
42
DEMOwithSparkNotebookinlocal
43
DEMOwithSparkNotebookinlocal
Advantages:ü InteracKve.ü AutomaKcplots.ü ItallowsconnecKonwithacluster.ü TabcompleKon
Disadvantages:q Built-inforspecificsparkversions.q Difficulttointegrateyourowncode.
44
DEMOwithScalaIDE
45
h<p://scala-ide.org/
Example:AnImbalancedBigDataproblem
n Two main approaches totacklethisproblem:n Datasampling:
n Undersampling,n Oversamplingn Hybridapproaches
n AlgorithmicmodificaKons
46
I. Trigueroet al,Evolu.onaryUndersampling for Extremely ImbalancedBigDataClassifica.onunderApacheSpark.IEEECongressonEvoluKonaryComputaKon(CEC2016),Vancouver(Canada),640-647,July24-29.
ImbalancedBigDataClassificaKonwithSpark
ImbalancedBigDataClassificaKonwithSpark
ImbalancedBigDataClassificaKonwithSpark
ImbalancedBigDataClassificaKonwithSpark
RunexamplesfromScalaIDE
Runexamplesfromterminal$ mvn package -Dmaven.test.skip=true
$ /opt/spark/bin/spark-submit --master local[*] --class Undersampling.UndersamplingExample target/EUS-0.0.1-BETA.jar hdfs://localhost:9000/user/pszit/datasets/ECBDL14_25.header hdfs://localhost:9000/user/pszit/datasets/ECBDL14_25-5-1tra100000.data hdfs://localhost:9000/user/pszit/datasets/ECBDL14_25-5-1tst10000.data 4 4 RUS DecisionTree /Users/pszit/outputRUS-DecisionTree
Outline
q WhatisBigdata?q HowtodealwithDataIntensiveapplicaKons?q BigDataAnalyKcsq AdemowithMLlibq Conclusions
53
Conclusions
• WeneednewstrategiestoperformMLinbigdatasets– Choosingtherighttechnologyislikechoosingtherightdatastructureinaprogram.
• Theworldofbigdataisrapidlychanging.Beingup-to-dateisdifficultbutnecessary.
• InteracKvenotebooksareveryusefulforaquickstartandstandardexperiments.
54
Acknowledgments
55
Thankyou
BigDataLearninginPrac.ce
12thSeptember2016
IsaacTrigueroSchool of Computer Science
University of Nottingham United Kingdom
Isaac.Triguero@no:ngham.ac.ukh<p://www.cs.no<.ac.uk/~pszit/
Extraslides
57
Volume:dataatrest
• Vastamountsofdatageneratedeverysecond
• DatasetsarebecomingtoolargetostoreusingtradiKonaldatabasetechnology
• Bigdatatechnologystoresthesedatasetsusingdistributedsystems
WhatisBigData?The5V’sdefiniKon
58
Velocity:datainmoKon
• Speedatwhich:– Dataisgenerated– Dataneedstobeanalyzed.
• ConKnuousdatastreamsarebeingcaptured(e.g.fromsensorsormobiledevices)andproduced
• LatedecisionsimplymissedopportuniKes
WhatisBigData?The5V’sdefiniKon
59
Variety:datainmanyforms
• One application may generate many different kind of data
• Several formats and structures: – Structured data:
• Tables, relation databases
– Unstructured data: • Text, images, audio,
video.
WhatisBigData?The5V’sdefiniKon
60
Veracity:dataindoubt
• Uncertaintyaboutthequalityofthedata.– E.g.naturallanguageprocessingonsocialmedia:typos,abbreviaKons,colloquialspeech.
• Datamaybemissing,ambiguous,orevencompletelywrong.
WhatisBigData?The5V’sdefiniKon
61
• MostimportantmoKvaKonforbigdata
• Bigdatamayresultin:– Be<erstaKsKcs/models– Novelinsights– NewopportuniKesforresearchandindustry
Value:datainuse
WhatisBigData?The5V’sdefiniKon
62
BigData:applicaKons
• Scienceandresearch:– E.g.Physics,BioinformaKcs,astronomy.
• Healthcareandpublichealth:– Be<erpersonalizedmedicine
• Businessande-commerce– PersonalizedadverKsement.
• Financialservices– Insurance,banks.
63
MapReduce• Basedonfunc.onalprogramming(e.g.Lisp)
– Operateson<key,value>pairs• Web-basedexample:key=URL;value=webpage• Graph-basedexample:key=nodes;value=adjacencylist
– UsersspecifiestwofuncKons:map:(k1,v1)→list[k2,v2]
reduce:(k2,list[v2])→list[k3,v3]– SorKngofintermediatekeysbetweenmapandreducephase
64
The dataflow in MapReduce is transparent to the programmers
MapReduce
65
HelloWorldByeWorld
InputFile MapkeyValueSplifng
ShortandShuffle
ReducekeyValuePairs
Output
Hello,1World,1Bye,1
World,1
Hello,1Hello,1
World,1World,1
Bye,1
HelloHadoopGoodbye
Hadoop
Hello,1
Hadoop,1Goodbye,1Hadoop,1
Hadoop,1Hadoop,1
Goodbye,1
Bye,{1}
World,{1,1}
Hello,{1,1}
Hadoop,{1,1}
Goodbye,{1,1}
Hello,2World,2Bye,1
Hadoop,2Goodbye,1
WordCountusingMapReduce
66
MachinelearningforBigData
• Dataminingtechniqueshavedemonstratedtobeveryusefultoolstoextractnewvaluableknowledgefromdata.
• TheknowledgeextracKonprocessfrombigdatahasbecomeaverydifficulttaskformostoftheclassicalandadvanceddataminingtools.
• Themainchallengesaretodealwith:– Theincreasingscaleofdata
• atthelevelofinstances• attheleveloffeatures
– Thecomplexityoftheproblem.– Andmanyotherpoints
67
Mllib:SparkMachinelearninglibrary
• MLlib(2010):isaSparkimplementaKonofsomecommonmachinelearningfuncKonality,aswellassociatedtestsanddatagenerators.
• Includes:– BinaryclassificaKon(SVMsand– LogisKcRegression)– RandomForest– Regression(Lasso,Ridge,etc.)– Clustering(K-Means)– CollaboraKveFiltering– GradientDescentOpKmizaKon– PrimiKve
h<ps://spark.apache.org/docs/latest/mllib-guide.html68