Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010...

BigDataLearninginPrac.ce

12thSeptember2016

IsaacTrigueroSchool of Computer Science

University of Nottingham United Kingdom

Isaac.Triguero@no:ngham.ac.ukh<p://www.cs.no<.ac.uk/~pszit/benelearn.html

Outline

q WhatisBigdata?q HowtodealwithDataIntensiveapplicaKons?q BigDataAnalyKcsq AdemowithMLlibq Conclusions

2

Thereisnoastandarddefini.on!

“BigData”involvesdatawhosevolume,diversityandcomplexityrequiresnewtechniques,algorithmsandanalysestoextractvaluableknowledge(hidden).

WhatisBigData?

DataIntensiveapplica.ons

3

WhatisBigData?The5V’sdefiniKon

4

Bigdatahasmanyfaces

5

Outline


6

•  Problemstatement:scalabilitytobigdatasets.•  Example:

– Explore100TBby1node@50MB/sec=23days– ExploraKonwithaclusterof1000nodes=33minutes

•  Solu.onèDivide-And-Conquer

HowtodealwithdataintensiveapplicaKons?

Whathappensifwehavetomanage1000or10000TB?

7

MapReduce

•  ParallelProgrammingmodel•  Divide&conquerstrategy

§  div ide : parKKon dataset into smal ler ,independent chunks to be processed in parallel(map)

§  conquer:combine,mergeorotherwiseaggregatetheresultsfromthepreviousstep(reduce)

•  Based on simplicity and transparency to theprogrammers,andassumesdatalocality.• Becomespopularthankstotheopen-sourceprojectHadoop!(UsedbyGoogle,Facebook,Amazon,…)

8

TradiKonalHPCwayofdoingthings

workernodes

(lotsofthem)

…

centralstorage

CommunicaKonnetwork(Infiniband)

NetworkforI/O

OS OS OS OS OS

iiiiii

LimitedI/O

c cc cc

inputdata(relaKvelysmall)

Lotsofcomputa.ons

Lotsofcommunica.on

Source:JanFos.er.Introduc.ontoMapReduceanditsApplica.ontoPost-SequencingAnalysis

Data-intensivejobs

Lowcomputeintensity…

FastcommunicaKonnetwork(Infiniband)

NetworkforI/O

OS OS OS OS OS

a

LimitedcommunicaKon

centralstorage

inputdata(lotsofit)b c d e

f g h i ja b c d ef g h i j

LotsofI/O

doesn’tscale

Data-intensivejobs

Lowcomputeintensity

…

CommunicaKonnetwork

LimitedcommunicaKon

inputdata(lotsofit)

e jb c

g ja c

h ib e

g id f

f ha d

Solu.on:storedataonlocaldisksofthenodesthatperformcomputaKonsonthatdata(“datalocality”)

Hadoop

h<p://hadoop.apache.org/

•  Hadoopis:– Anopen-sourceframeworkwri<eninJava– Distributedstorageofverylargedatasets(BigData)– Distributedprocessingofverylargedatasets

•  Thisframeworkconsistsofanumberofmodules– HadoopCommon– HadoopDistributedFileSystem(HDFS)– HadoopYARN–resourcemanager– HadoopMapReduce–programmingmodel

12

•  Automa.cparalleliza.on:– DependingonthesizeoftheinputdataètherewillbemulKpleMAPtasks!

– DependingonthenumberofKeys<k,value>ètherewillbemulKpleREDUCEtasks!

•  Scalability:–  Itmayworkovereverydatacenterorclusterofcomputers.

•  Transparentfortheprogrammer–  Fault-tolerantmechanism.– AutomaKccommunicaKonsamongcomputers

HadoopMapReduce:MainCharacterisKcs

13

DataSharinginHadoopMapReduce

iter.1 iter.2 ...

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query1

query2

query3

result1

result2

result3

...

HDFSread

SlowduetoreplicaKon,serializaKon,anddiskIO

14

ParadigmsthatdonotfitwithHadoopMapReduce

•  DirectedAcyclicGraph(DAG)model:–  TheDAGdefinesthedataflowoftheapplicaKon,andtheverKcesofthegraphdefinestheoperaKonsonthedata.

•  Graphmodel:– Morecomplexgraphmodelsthatbe<errepresentthedataflowoftheapplicaKon.

–  Cyclicmodels->IteraKvity.•  Itera.veMapReducemodel:

–  AnextentedprogrammingmodelthatsupportsiteraKveMapReducecomputaKonsefficiently.

15

GIRAPH(APACHEProject)(h<p://giraph.apache.org/)Itera8vegraphprocessing

GPS-AGraphProcessingSystem,(Stanford)h<p://infolab.stanford.edu/gps/Amazon'sEC2

DistributedGraphLab(CarnegieMellonUniv.)h<ps://github.com/graphlab-code/graphlabAmazon'sEC2

HaLoop (UniversityofWashington)

h<p://clue.cs.washington.edu/node/14h<p://code.google.com/p/haloop/Amazon’sEC2

Twister(IndianaUniversity)h<p://www.iteraKvemapreduce.org/PrivateClusters

PrIter(Universityof Massachuse<sAmherst, NortheasternUniversity-China)h<p://code.google.com/p/priter/PrivateclusterandAmazonEC2cloud

GPUbasedplauormsMarsGrex

Spark(UCBerkeley)h<p://spark.incubator.apache.org/research.html

NewplauormstoovercomeHadoop’slimitaKons

16

Bigdatatechnologies

17

WhatisSpark?

Efficient

•  GeneralexecuKongraphs•  In-memorystorage

Usable

•  RichAPIsinJava,Scala,Python

•  InteracKveshell

Fast and Expressive Cluster Computing !Engine Compatible with Apache Hadoop

2-5×lesscode

Upto10×fasterondisk,100×inmemory

18

SparkGoal•  ProvidedistributedmemoryabstracKonsforclusterstosupportappswithworkingsets

•  RetaintheaZrac.veproper.esofMapReduce:– Faulttolerance(forcrashes&stragglers)– Datalocality– Scalability

Ini.alSolu.on:augmentdataflowmodelwith“resilientdistributeddatasets”(RDDs)

19

RDDsinDetail

•  AnRDDisafault-tolerantcollecKonofelementsthatcanbeoperatedoninparallel.

•  TherearetwowaystocreateRDDs:– ParallelizinganexisKngcollecKoninyourdriverprogram

– Referencingadatasetinanexternalstoragesystem,suchasasharedfilesystem,HDFS,Hbase.

•  Canbecachedforfuturereuse

20

OperaKonswithRDDs•  TransformaKons(e.g.map,filter,groupBy,join)

– LazyoperaKonstobuildRDDsfromotherRDDs•  AcKons(e.g.count,collect,save)

– ReturnaresultorwriteittostorageTransformations(defineanewRDD)

mapfiltersampleuniongroupByKeyreduceByKeyjoincache…

Paralleloperations(returnaresulttodriver)

reducecollectcountsavelookupKey…

21

Sparkvs.hadoop

22

274

157

106

197

121

87

143

61

33

0

50

100

150

200

250

300

25 50 100

Iter

atio

n ti

me

(s)

Number of machines

Hadoop

HadoopBinMem

Spark

K-Means

[Zaharia et. al, NSDI’12]

Lines of code for K-Means

Spark ~ 90 lines –

Hadoop ~ 4 files, > 300 lines

DataFrame(Spark1.3+)-EquivalenttoatableinarelaKonaldatabase(dataframeinR/Python)-AvoidJavaserializaKonperformedbyRDDs.-APInaturalfordeveloperswhoarefamiliarwithbuildingqueryplans(e.g.SQLexpressions).

Datasets(Spark1.6+)-  BestofbothDataFrameandRDDs.-  FuncKonaltransformaKons(map,flatMap,filter,etc)-  SparkSQL’sopKmisedexecuKonengine.

ApacheSpark–newcollecKons

23

h<ps://flink.apache.org/

Flink

24

BigData:TechnologyandChronology

2001-2010

2010-2016

BigData

2001

3V’sGartner

DougLaney2004

MapReduceGoogle

JeffreyDean

2008HadoopYahoo!

DougCufng

2010SparkUBerckeleyApacheSparkFeb.2014

MateiZaharia

2009-2013FlinkTUBerlin

FlinkApache(Dec.2014)Volker

Markl

2010-2016:BigDataAnalyKcs:Mahout,MLLib,…HadoopEcosystemApplicaKonsNewTechnology

25

Outline


26

Clustering

Recommendation Systems

Classification

Association

Poten.alscenarios

Real Time Analytics/ Big Data Streams

SocialMediaMiningSocialBigData

BigDataAnalyKcs

27

BigDataAnalyKcs:A3generaKonalview

28

Mahout(Samsara)

29h<p://mahout.apache.org/

•  FirstMLlibraryiniKallybasedonHadoopMapReduce.•  AbandonedMapReduceimplementaKonsfromversion0.9.•  Nowadays it is focused on a newmath environment called

Samsara.•  ItisintegratedwithSpark,FlinkandH2O•  Mainalgorithms:

–  StochasKcSingularValueDecomposiKon(ssvd,dssvd)–  StochasKcPrincipalComponentAnalysis(spca,dspca)–  DistributedCholeskyQR(thinQR)–  DistributedregularizedAlternaKngLeastSquares(dals)–  CollaboraKveFiltering:ItemandRowSimilarity–  NaiveBayesClassificaKon

h<ps://spark.apache.org/mllib/

SparkLibraries

30

AsofSpark2.0

31

h<ps://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/ml/

FlinkML

32

Outline


33

Demo

•  InthisdemoIwillshowtwowaysofworkingwithApacheSpark:–  InteracKvemodewithSparkNotebook.– StandalonemodewithScalaIDE.

•  AllthecodeusedinthispresentaKonisavailableat:

h<p://www.cs.no<.ac.uk/~pszit/benelearn.html

34

DEMOwithSparkNotebookinlocal

h<p://spark-notebook.io/

35


36


37


38


39


40


41


42


43


Advantages:ü  InteracKve.ü  AutomaKcplots.ü  ItallowsconnecKonwithacluster.ü  TabcompleKon

Disadvantages:q Built-inforspecificsparkversions.q Difficulttointegrateyourowncode.

44

DEMOwithScalaIDE

45

h<p://scala-ide.org/

Example:AnImbalancedBigDataproblem

n  Two main approaches totacklethisproblem:n  Datasampling:

n  Undersampling,n  Oversamplingn  Hybridapproaches

n  AlgorithmicmodificaKons

46

I. Trigueroet al,Evolu.onaryUndersampling for Extremely ImbalancedBigDataClassifica.onunderApacheSpark.IEEECongressonEvoluKonaryComputaKon(CEC2016),Vancouver(Canada),640-647,July24-29.

ImbalancedBigDataClassificaKonwithSpark

RunexamplesfromScalaIDE

Runexamplesfromterminal$ mvn package -Dmaven.test.skip=true

$ /opt/spark/bin/spark-submit --master local[*] --class Undersampling.UndersamplingExample target/EUS-0.0.1-BETA.jar hdfs://localhost:9000/user/pszit/datasets/ECBDL14_25.header hdfs://localhost:9000/user/pszit/datasets/ECBDL14_25-5-1tra100000.data hdfs://localhost:9000/user/pszit/datasets/ECBDL14_25-5-1tst10000.data 4 4 RUS DecisionTree /Users/pszit/outputRUS-DecisionTree

Outline


53

Conclusions

•  WeneednewstrategiestoperformMLinbigdatasets– Choosingtherighttechnologyislikechoosingtherightdatastructureinaprogram.

•  Theworldofbigdataisrapidlychanging.Beingup-to-dateisdifficultbutnecessary.

•  InteracKvenotebooksareveryusefulforaquickstartandstandardexperiments.

54

Acknowledgments

55

Thankyou

BigDataLearninginPrac.ce

12thSeptember2016

IsaacTrigueroSchool of Computer Science

University of Nottingham United Kingdom

Isaac.Triguero@no:ngham.ac.ukh<p://www.cs.no<.ac.uk/~pszit/

Extraslides

57

Volume:dataatrest

•  Vastamountsofdatageneratedeverysecond

•  DatasetsarebecomingtoolargetostoreusingtradiKonaldatabasetechnology

•  Bigdatatechnologystoresthesedatasetsusingdistributedsystems


58

Velocity:datainmoKon

•  Speedatwhich:– Dataisgenerated– Dataneedstobeanalyzed.

•  ConKnuousdatastreamsarebeingcaptured(e.g.fromsensorsormobiledevices)andproduced

•  LatedecisionsimplymissedopportuniKes


59

Variety:datainmanyforms

•  One application may generate many different kind of data

•  Several formats and structures: – Structured data:

• Tables, relation databases

– Unstructured data: • Text, images, audio,

video.


60

Veracity:dataindoubt

•  Uncertaintyaboutthequalityofthedata.– E.g.naturallanguageprocessingonsocialmedia:typos,abbreviaKons,colloquialspeech.

•  Datamaybemissing,ambiguous,orevencompletelywrong.


61

•  MostimportantmoKvaKonforbigdata

•  Bigdatamayresultin:– Be<erstaKsKcs/models– Novelinsights– NewopportuniKesforresearchandindustry

Value:datainuse


62

BigData:applicaKons

•  Scienceandresearch:– E.g.Physics,BioinformaKcs,astronomy.

•  Healthcareandpublichealth:– Be<erpersonalizedmedicine

•  Businessande-commerce– PersonalizedadverKsement.

•  Financialservices–  Insurance,banks.

63

MapReduce•  Basedonfunc.onalprogramming(e.g.Lisp)

– Operateson<key,value>pairs• Web-basedexample:key=URL;value=webpage•  Graph-basedexample:key=nodes;value=adjacencylist

– UsersspecifiestwofuncKons:map:(k1,v1)→list[k2,v2]

reduce:(k2,list[v2])→list[k3,v3]–  SorKngofintermediatekeysbetweenmapandreducephase

64

The dataflow in MapReduce is transparent to the programmers

MapReduce

65

HelloWorldByeWorld

InputFile MapkeyValueSplifng

ShortandShuffle

ReducekeyValuePairs

Output

Hello,1World,1Bye,1

World,1

Hello,1Hello,1

World,1World,1

Bye,1

HelloHadoopGoodbye

Hadoop

Hello,1

Hadoop,1Goodbye,1Hadoop,1

Hadoop,1Hadoop,1

Goodbye,1

Bye,{1}

World,{1,1}

Hello,{1,1}

Hadoop,{1,1}

Goodbye,{1,1}

Hello,2World,2Bye,1

Hadoop,2Goodbye,1

WordCountusingMapReduce

66

MachinelearningforBigData

•  Dataminingtechniqueshavedemonstratedtobeveryusefultoolstoextractnewvaluableknowledgefromdata.

•  TheknowledgeextracKonprocessfrombigdatahasbecomeaverydifficulttaskformostoftheclassicalandadvanceddataminingtools.

•  Themainchallengesaretodealwith:–  Theincreasingscaleofdata

•  atthelevelofinstances•  attheleveloffeatures

–  Thecomplexityoftheproblem.–  Andmanyotherpoints

67

Mllib:SparkMachinelearninglibrary

•  MLlib(2010):isaSparkimplementaKonofsomecommonmachinelearningfuncKonality,aswellassociatedtestsanddatagenerators.

•  Includes:–  BinaryclassificaKon(SVMsand–  LogisKcRegression)–  RandomForest–  Regression(Lasso,Ridge,etc.)–  Clustering(K-Means)–  CollaboraKveFiltering–  GradientDescentOpKmizaKon–  PrimiKve

h<ps://spark.apache.org/docs/latest/mllib-guide.html68

Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010...

Documents

Transcript of Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010...