Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010...

68
Big Data Learning in Prac.ce 12th September 2016 Isaac Triguero School of Computer Science University of Nottingham United Kingdom Isaac.Triguero@no:ngham.ac.uk h<p://www.cs.no<.ac.uk/~pszit/benelearn.html

Transcript of Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010...

Page 1: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

BigDataLearninginPrac.ce

12thSeptember2016

IsaacTrigueroSchool of Computer Science

University of Nottingham United Kingdom

Isaac.Triguero@no:ngham.ac.ukh<p://www.cs.no<.ac.uk/~pszit/benelearn.html

Page 2: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Outline

q WhatisBigdata?q HowtodealwithDataIntensiveapplicaKons?q BigDataAnalyKcsq AdemowithMLlibq Conclusions

2

Page 3: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Thereisnoastandarddefini.on!

“BigData”involvesdatawhosevolume,diversityandcomplexityrequiresnewtechniques,algorithmsandanalysestoextractvaluableknowledge(hidden).

WhatisBigData?

DataIntensiveapplica.ons

3

Page 4: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

WhatisBigData?The5V’sdefiniKon

4

Page 5: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Bigdatahasmanyfaces

5

Page 6: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Outline

q WhatisBigdata?q HowtodealwithDataIntensiveapplicaKons?q BigDataAnalyKcsq AdemowithMLlibq Conclusions

6

Page 7: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

•  Problemstatement:scalabilitytobigdatasets.•  Example:

– Explore100TBby1node@50MB/sec=23days– ExploraKonwithaclusterof1000nodes=33minutes

•  Solu.onèDivide-And-Conquer

HowtodealwithdataintensiveapplicaKons?

Whathappensifwehavetomanage1000or10000TB?

7

Page 8: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

MapReduce

•  ParallelProgrammingmodel•  Divide&conquerstrategy

§  div ide : parKKon dataset into smal ler ,independent chunks to be processed in parallel(map)

§  conquer:combine,mergeorotherwiseaggregatetheresultsfromthepreviousstep(reduce)

•  Based on simplicity and transparency to theprogrammers,andassumesdatalocality.• Becomespopularthankstotheopen-sourceprojectHadoop!(UsedbyGoogle,Facebook,Amazon,…)

8

Page 9: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

TradiKonalHPCwayofdoingthings

workernodes

(lotsofthem)

centralstorage

CommunicaKonnetwork(Infiniband)

NetworkforI/O

OS OS OS OS OS

iiiiii

LimitedI/O

c cc cc

inputdata(relaKvelysmall)

Lotsofcomputa.ons

Lotsofcommunica.on

Source:JanFos.er.Introduc.ontoMapReduceanditsApplica.ontoPost-SequencingAnalysis

Page 10: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Data-intensivejobs

Lowcomputeintensity…

FastcommunicaKonnetwork(Infiniband)

NetworkforI/O

OS OS OS OS OS

a

LimitedcommunicaKon

centralstorage

inputdata(lotsofit)b c d e

f g h i ja b c d ef g h i j

LotsofI/O

doesn’tscale

Page 11: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Data-intensivejobs

Lowcomputeintensity

CommunicaKonnetwork

LimitedcommunicaKon

inputdata(lotsofit)

e jb c

g ja c

h ib e

g id f

f ha d

Solu.on:storedataonlocaldisksofthenodesthatperformcomputaKonsonthatdata(“datalocality”)

Page 12: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Hadoop

h<p://hadoop.apache.org/

•  Hadoopis:– Anopen-sourceframeworkwri<eninJava– Distributedstorageofverylargedatasets(BigData)– Distributedprocessingofverylargedatasets

•  Thisframeworkconsistsofanumberofmodules– HadoopCommon– HadoopDistributedFileSystem(HDFS)– HadoopYARN–resourcemanager– HadoopMapReduce–programmingmodel

12

Page 13: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

•  Automa.cparalleliza.on:– DependingonthesizeoftheinputdataètherewillbemulKpleMAPtasks!

– DependingonthenumberofKeys<k,value>ètherewillbemulKpleREDUCEtasks!

•  Scalability:–  Itmayworkovereverydatacenterorclusterofcomputers.

•  Transparentfortheprogrammer–  Fault-tolerantmechanism.– AutomaKccommunicaKonsamongcomputers

HadoopMapReduce:MainCharacterisKcs

13

Page 14: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

DataSharinginHadoopMapReduce

iter.1 iter.2 ...

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query1

query2

query3

result1

result2

result3

...

HDFSread

SlowduetoreplicaKon,serializaKon,anddiskIO

14

Page 15: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

ParadigmsthatdonotfitwithHadoopMapReduce

•  DirectedAcyclicGraph(DAG)model:–  TheDAGdefinesthedataflowoftheapplicaKon,andtheverKcesofthegraphdefinestheoperaKonsonthedata.

•  Graphmodel:– Morecomplexgraphmodelsthatbe<errepresentthedataflowoftheapplicaKon.

–  Cyclicmodels->IteraKvity.•  Itera.veMapReducemodel:

–  AnextentedprogrammingmodelthatsupportsiteraKveMapReducecomputaKonsefficiently.

15

Page 16: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

GIRAPH(APACHEProject)(h<p://giraph.apache.org/)Itera8vegraphprocessing

GPS-AGraphProcessingSystem,(Stanford)h<p://infolab.stanford.edu/gps/Amazon'sEC2

DistributedGraphLab(CarnegieMellonUniv.)h<ps://github.com/graphlab-code/graphlabAmazon'sEC2

HaLoop (UniversityofWashington)

h<p://clue.cs.washington.edu/node/14h<p://code.google.com/p/haloop/Amazon’sEC2

Twister(IndianaUniversity)h<p://www.iteraKvemapreduce.org/PrivateClusters

PrIter(Universityof Massachuse<sAmherst, NortheasternUniversity-China)h<p://code.google.com/p/priter/PrivateclusterandAmazonEC2cloud

GPUbasedplauormsMarsGrex

Spark(UCBerkeley)h<p://spark.incubator.apache.org/research.html

NewplauormstoovercomeHadoop’slimitaKons

16

Page 17: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Bigdatatechnologies

17

Page 18: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

WhatisSpark?

Efficient

•  GeneralexecuKongraphs•  In-memorystorage

Usable

•  RichAPIsinJava,Scala,Python

•  InteracKveshell

Fast and Expressive Cluster Computing !Engine Compatible with Apache Hadoop

2-5×lesscode

Upto10×fasterondisk,100×inmemory

18

Page 19: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

SparkGoal•  ProvidedistributedmemoryabstracKonsforclusterstosupportappswithworkingsets

•  RetaintheaZrac.veproper.esofMapReduce:– Faulttolerance(forcrashes&stragglers)– Datalocality– Scalability

Ini.alSolu.on:augmentdataflowmodelwith“resilientdistributeddatasets”(RDDs)

19

Page 20: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

RDDsinDetail

•  AnRDDisafault-tolerantcollecKonofelementsthatcanbeoperatedoninparallel.

•  TherearetwowaystocreateRDDs:– ParallelizinganexisKngcollecKoninyourdriverprogram

– Referencingadatasetinanexternalstoragesystem,suchasasharedfilesystem,HDFS,Hbase.

•  Canbecachedforfuturereuse

20

Page 21: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

OperaKonswithRDDs•  TransformaKons(e.g.map,filter,groupBy,join)

– LazyoperaKonstobuildRDDsfromotherRDDs•  AcKons(e.g.count,collect,save)

– ReturnaresultorwriteittostorageTransformations(defineanewRDD)

mapfiltersampleuniongroupByKeyreduceByKeyjoincache…

Paralleloperations(returnaresulttodriver)

reducecollectcountsavelookupKey…

21

Page 22: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Sparkvs.hadoop

22

274

157

106

197

121

87

143

61

33

0

50

100

150

200

250

300

25 50 100

Iter

atio

n ti

me

(s)

Number of machines

Hadoop

HadoopBinMem

Spark

K-Means

[Zaharia et. al, NSDI’12]

Lines of code for K-Means

Spark ~ 90 lines –

Hadoop ~ 4 files, > 300 lines

Page 23: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

DataFrame(Spark1.3+)-EquivalenttoatableinarelaKonaldatabase(dataframeinR/Python)-AvoidJavaserializaKonperformedbyRDDs.-APInaturalfordeveloperswhoarefamiliarwithbuildingqueryplans(e.g.SQLexpressions).

Datasets(Spark1.6+)-  BestofbothDataFrameandRDDs.-  FuncKonaltransformaKons(map,flatMap,filter,etc)-  SparkSQL’sopKmisedexecuKonengine.

ApacheSpark–newcollecKons

23

Page 24: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

h<ps://flink.apache.org/

Flink

24

Page 25: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

BigData:TechnologyandChronology

2001-2010

2010-2016

BigData

2001

3V’sGartner

DougLaney2004

MapReduceGoogle

JeffreyDean

2008HadoopYahoo!

DougCufng

2010SparkUBerckeleyApacheSparkFeb.2014

MateiZaharia

2009-2013FlinkTUBerlin

FlinkApache(Dec.2014)Volker

Markl

2010-2016:BigDataAnalyKcs:Mahout,MLLib,…HadoopEcosystemApplicaKonsNewTechnology

25

Page 26: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Outline

q WhatisBigdata?q HowtodealwithDataIntensiveapplicaKons?q BigDataAnalyKcsq AdemowithMLlibq Conclusions

26

Page 27: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Clustering

Recommendation Systems

Classification

Association

Poten.alscenarios

Real Time Analytics/ Big Data Streams

SocialMediaMiningSocialBigData

BigDataAnalyKcs

27

Page 28: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

BigDataAnalyKcs:A3generaKonalview

28

Page 29: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Mahout(Samsara)

29h<p://mahout.apache.org/

•  FirstMLlibraryiniKallybasedonHadoopMapReduce.•  AbandonedMapReduceimplementaKonsfromversion0.9.•  Nowadays it is focused on a newmath environment called

Samsara.•  ItisintegratedwithSpark,FlinkandH2O•  Mainalgorithms:

–  StochasKcSingularValueDecomposiKon(ssvd,dssvd)–  StochasKcPrincipalComponentAnalysis(spca,dspca)–  DistributedCholeskyQR(thinQR)–  DistributedregularizedAlternaKngLeastSquares(dals)–  CollaboraKveFiltering:ItemandRowSimilarity–  NaiveBayesClassificaKon

Page 30: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

h<ps://spark.apache.org/mllib/

SparkLibraries

30

Page 31: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

AsofSpark2.0

31

Page 32: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

h<ps://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/ml/

FlinkML

32

Page 33: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Outline

q WhatisBigdata?q HowtodealwithDataIntensiveapplicaKons?q BigDataAnalyKcsq AdemowithMLlibq Conclusions

33

Page 34: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Demo

•  InthisdemoIwillshowtwowaysofworkingwithApacheSpark:–  InteracKvemodewithSparkNotebook.– StandalonemodewithScalaIDE.

•  AllthecodeusedinthispresentaKonisavailableat:

h<p://www.cs.no<.ac.uk/~pszit/benelearn.html

34

Page 35: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

DEMOwithSparkNotebookinlocal

h<p://spark-notebook.io/

35

Page 36: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

DEMOwithSparkNotebookinlocal

36

Page 37: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

DEMOwithSparkNotebookinlocal

37

Page 38: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

DEMOwithSparkNotebookinlocal

38

Page 39: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

DEMOwithSparkNotebookinlocal

39

Page 40: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

DEMOwithSparkNotebookinlocal

40

Page 41: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

DEMOwithSparkNotebookinlocal

41

Page 42: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

DEMOwithSparkNotebookinlocal

42

Page 43: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

DEMOwithSparkNotebookinlocal

43

Page 44: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

DEMOwithSparkNotebookinlocal

Advantages:ü  InteracKve.ü  AutomaKcplots.ü  ItallowsconnecKonwithacluster.ü  TabcompleKon

Disadvantages:q Built-inforspecificsparkversions.q Difficulttointegrateyourowncode.

44

Page 45: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

DEMOwithScalaIDE

45

h<p://scala-ide.org/

Page 46: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Example:AnImbalancedBigDataproblem

n  Two main approaches totacklethisproblem:n  Datasampling:

n  Undersampling,n  Oversamplingn  Hybridapproaches

n  AlgorithmicmodificaKons

46

I. Trigueroet al,Evolu.onaryUndersampling for Extremely ImbalancedBigDataClassifica.onunderApacheSpark.IEEECongressonEvoluKonaryComputaKon(CEC2016),Vancouver(Canada),640-647,July24-29.

Page 47: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

ImbalancedBigDataClassificaKonwithSpark

Page 48: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

ImbalancedBigDataClassificaKonwithSpark

Page 49: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

ImbalancedBigDataClassificaKonwithSpark

Page 50: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

ImbalancedBigDataClassificaKonwithSpark

Page 51: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

RunexamplesfromScalaIDE

Page 52: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Runexamplesfromterminal$ mvn package -Dmaven.test.skip=true

$ /opt/spark/bin/spark-submit --master local[*] --class Undersampling.UndersamplingExample target/EUS-0.0.1-BETA.jar hdfs://localhost:9000/user/pszit/datasets/ECBDL14_25.header hdfs://localhost:9000/user/pszit/datasets/ECBDL14_25-5-1tra100000.data hdfs://localhost:9000/user/pszit/datasets/ECBDL14_25-5-1tst10000.data 4 4 RUS DecisionTree /Users/pszit/outputRUS-DecisionTree

Page 53: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Outline

q WhatisBigdata?q HowtodealwithDataIntensiveapplicaKons?q BigDataAnalyKcsq AdemowithMLlibq Conclusions

53

Page 54: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Conclusions

•  WeneednewstrategiestoperformMLinbigdatasets– Choosingtherighttechnologyislikechoosingtherightdatastructureinaprogram.

•  Theworldofbigdataisrapidlychanging.Beingup-to-dateisdifficultbutnecessary.

•  InteracKvenotebooksareveryusefulforaquickstartandstandardexperiments.

54

Page 55: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Acknowledgments

55

Thankyou

Page 56: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

BigDataLearninginPrac.ce

12thSeptember2016

IsaacTrigueroSchool of Computer Science

University of Nottingham United Kingdom

Isaac.Triguero@no:ngham.ac.ukh<p://www.cs.no<.ac.uk/~pszit/

Page 57: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Extraslides

57

Page 58: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Volume:dataatrest

•  Vastamountsofdatageneratedeverysecond

•  DatasetsarebecomingtoolargetostoreusingtradiKonaldatabasetechnology

•  Bigdatatechnologystoresthesedatasetsusingdistributedsystems

WhatisBigData?The5V’sdefiniKon

58

Page 59: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Velocity:datainmoKon

•  Speedatwhich:– Dataisgenerated– Dataneedstobeanalyzed.

•  ConKnuousdatastreamsarebeingcaptured(e.g.fromsensorsormobiledevices)andproduced

•  LatedecisionsimplymissedopportuniKes

WhatisBigData?The5V’sdefiniKon

59

Page 60: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Variety:datainmanyforms

•  One application may generate many different kind of data

•  Several formats and structures: – Structured data:

• Tables, relation databases

– Unstructured data: • Text, images, audio,

video.

WhatisBigData?The5V’sdefiniKon

60

Page 61: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Veracity:dataindoubt

•  Uncertaintyaboutthequalityofthedata.– E.g.naturallanguageprocessingonsocialmedia:typos,abbreviaKons,colloquialspeech.

•  Datamaybemissing,ambiguous,orevencompletelywrong.

WhatisBigData?The5V’sdefiniKon

61

Page 62: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

•  MostimportantmoKvaKonforbigdata

•  Bigdatamayresultin:– Be<erstaKsKcs/models– Novelinsights– NewopportuniKesforresearchandindustry

Value:datainuse

WhatisBigData?The5V’sdefiniKon

62

Page 63: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

BigData:applicaKons

•  Scienceandresearch:– E.g.Physics,BioinformaKcs,astronomy.

•  Healthcareandpublichealth:– Be<erpersonalizedmedicine

•  Businessande-commerce– PersonalizedadverKsement.

•  Financialservices–  Insurance,banks.

63

Page 64: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

MapReduce•  Basedonfunc.onalprogramming(e.g.Lisp)

– Operateson<key,value>pairs• Web-basedexample:key=URL;value=webpage•  Graph-basedexample:key=nodes;value=adjacencylist

– UsersspecifiestwofuncKons:map:(k1,v1)→list[k2,v2]

reduce:(k2,list[v2])→list[k3,v3]–  SorKngofintermediatekeysbetweenmapandreducephase

64

Page 65: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

The dataflow in MapReduce is transparent to the programmers

MapReduce

65

Page 66: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

HelloWorldByeWorld

InputFile MapkeyValueSplifng

ShortandShuffle

ReducekeyValuePairs

Output

Hello,1World,1Bye,1

World,1

Hello,1Hello,1

World,1World,1

Bye,1

HelloHadoopGoodbye

Hadoop

Hello,1

Hadoop,1Goodbye,1Hadoop,1

Hadoop,1Hadoop,1

Goodbye,1

Bye,{1}

World,{1,1}

Hello,{1,1}

Hadoop,{1,1}

Goodbye,{1,1}

Hello,2World,2Bye,1

Hadoop,2Goodbye,1

WordCountusingMapReduce

66

Page 67: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

MachinelearningforBigData

•  Dataminingtechniqueshavedemonstratedtobeveryusefultoolstoextractnewvaluableknowledgefromdata.

•  TheknowledgeextracKonprocessfrombigdatahasbecomeaverydifficulttaskformostoftheclassicalandadvanceddataminingtools.

•  Themainchallengesaretodealwith:–  Theincreasingscaleofdata

•  atthelevelofinstances•  attheleveloffeatures

–  Thecomplexityoftheproblem.–  Andmanyotherpoints

67

Page 68: Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010 2010-2016 Big Data 2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean

Mllib:SparkMachinelearninglibrary

•  MLlib(2010):isaSparkimplementaKonofsomecommonmachinelearningfuncKonality,aswellassociatedtestsanddatagenerators.

•  Includes:–  BinaryclassificaKon(SVMsand–  LogisKcRegression)–  RandomForest–  Regression(Lasso,Ridge,etc.)–  Clustering(K-Means)–  CollaboraKveFiltering–  GradientDescentOpKmizaKon–  PrimiKve

h<ps://spark.apache.org/docs/latest/mllib-guide.html68