CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming...
Transcript of CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming...
![Page 1: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is](https://reader036.fdocuments.us/reader036/viewer/2022081521/5abfbc4e7f8b9add5f8e1853/html5/thumbnails/1.jpg)
CS378–BigDataProgramming
Lecture14JoinPa:erns
CS378-Fall2016 BigDataProgramming 1
![Page 2: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is](https://reader036.fdocuments.us/reader036/viewer/2022081521/5abfbc4e7f8b9add5f8e1853/html5/thumbnails/2.jpg)
Review
• Assignment6–Reduce-sidejoin– Usersessionandimpressiondata
• QuesKons/issues?
• Review:infoinsyslog
• AvroMultipleInputs
CS378-Fall2016 BigDataProgramming 2
![Page 3: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is](https://reader036.fdocuments.us/reader036/viewer/2022081521/5abfbc4e7f8b9add5f8e1853/html5/thumbnails/3.jpg)
JoinPa:erns
• Review:Supposewewanttojoinmanysources,onlyoneofwhichislarge– Usersessions(large)– MapfromciKestoDMA(demographicmarkeKngarea)– …
• Thisiscalledareplicatedjoin– Allthesmallfileswillbereplicatedtoallmachines
CS378-Fall2016 BigDataProgramming 3
![Page 4: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is](https://reader036.fdocuments.us/reader036/viewer/2022081521/5abfbc4e7f8b9add5f8e1853/html5/thumbnails/4.jpg)
ReplicatedJoin
• Canbedonecompletelyinmappers– Noneedforsort,shuffle,orreduce– FilesarereplicatedwithDistributedCache
• RestricKons:– Allbutoneoftheinputsmustfitinmemory– Canonlyaccomplishaninnerjoin,or– Ale]outerjoinwherethelargedatasourceis“le]”part
CS378-Fall2016 BigDataProgramming 4
![Page 5: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is](https://reader036.fdocuments.us/reader036/viewer/2022081521/5abfbc4e7f8b9add5f8e1853/html5/thumbnails/5.jpg)
ReplicatedJoin-DataFlowFigure5-2fromMapReduceDesignPa:erns
CS378-Fall2016 BigDataProgramming 5
![Page 6: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is](https://reader036.fdocuments.us/reader036/viewer/2022081521/5abfbc4e7f8b9add5f8e1853/html5/thumbnails/6.jpg)
JoinPa:erns
• OK,soreplicatedjoinwasinteresKng,butmorethanoneofmydatasourcesislarge.
• Isthereawaytodoamap-sidejoininthiscase?• Orisreduce-sidejoinmyonlyopKon?
• Ifweorganizetheinputdatainaspecificway,• Wecandothisonthemap-side.
CS378-Fall2016 BigDataProgramming 6
![Page 7: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is](https://reader036.fdocuments.us/reader036/viewer/2022081521/5abfbc4e7f8b9add5f8e1853/html5/thumbnails/7.jpg)
CompositeJoin
• HadoopclassCompositeInputFormat
• Restrictedtoinner,orfullouterjoin• Inputdatasetsmusthavethesame#ofparKKons– EachinputparKKonmustbesortedbykey– AllrecordsforaparKcularkeymustbeinthesameparKKon
• Seemspre:yrestricKve…
CS378-Fall2016 BigDataProgramming 7
![Page 8: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is](https://reader036.fdocuments.us/reader036/viewer/2022081521/5abfbc4e7f8b9add5f8e1853/html5/thumbnails/8.jpg)
CompositeJoin
• ThesecondiKonsmightexistfordatafromothermapReducejobswhere:
• Thejobshadthesame#ofreducers– RecallthatinputdatasetsmustbeparKKonedinsameway
• Thejobshadthesameforeignkey• Outputfilesaren’tspli:able
CS378-Fall2016 BigDataProgramming 8
![Page 9: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is](https://reader036.fdocuments.us/reader036/viewer/2022081521/5abfbc4e7f8b9add5f8e1853/html5/thumbnails/9.jpg)
CompositeJoin
• IfallthosecondiKonsaretrue,thisjoinworks– Map-sideonly,soit’sefficientifwecanuseit.
• Ifyoufindthatyouarepreparingandformamngthedataonlytobeabletousecompositejoin
• It’sprobablynotworthit.• Justuseareduce-sidejoin.
CS378-Fall2016 BigDataProgramming 9
![Page 10: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is](https://reader036.fdocuments.us/reader036/viewer/2022081521/5abfbc4e7f8b9add5f8e1853/html5/thumbnails/10.jpg)
CompositeJoin–Data
CS378-Fall2016 BigDataProgramming 10
![Page 11: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is](https://reader036.fdocuments.us/reader036/viewer/2022081521/5abfbc4e7f8b9add5f8e1853/html5/thumbnails/11.jpg)
CompositeJoin–DataFlow
CS378-Fall2016 BigDataProgramming 11
![Page 12: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is](https://reader036.fdocuments.us/reader036/viewer/2022081521/5abfbc4e7f8b9add5f8e1853/html5/thumbnails/12.jpg)
CompositeJoinInput
• Inthedrivercode(run()method)– Getthefilenamesfromthecommandline– Specifytheinputformat,jointype,andfiles
conf.setInputFormat(CompositeInputFormat.class);
conf.set(“mapred.join.expr”,
CompositeInputFormat.compose(“inner”, KeyValueTextInputFormat.class, file1, file2));
CS378-Fall2016 BigDataProgramming 12
![Page 13: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is](https://reader036.fdocuments.us/reader036/viewer/2022081521/5abfbc4e7f8b9add5f8e1853/html5/thumbnails/13.jpg)
CompositeJoinInput
• Howmightthisimplementinnerjoin?– Outerjoin?
• Couldwedoanyotherjointype?– Le]outer?AnK-join?
• Output:TupleWritable
CS378-Fall2016 BigDataProgramming 13
![Page 14: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is](https://reader036.fdocuments.us/reader036/viewer/2022081521/5abfbc4e7f8b9add5f8e1853/html5/thumbnails/14.jpg)
OneMoreJoinPa:ern
• Supposewewantedtocompareallcarscurrentlyavailable(forsale)toallothercars– ToidenKfy“similar”cars– Usage:“Ilikethiscar,showmeotherslikeit”
• Thisjoiniscalled“CartesianProduct”– CompareNitemstoMitemsrequiresNxMcomparisons– Notstraighqorwardtodowithmap-reduce
CS378-Fall2016 BigDataProgramming 14
![Page 15: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is](https://reader036.fdocuments.us/reader036/viewer/2022081521/5abfbc4e7f8b9add5f8e1853/html5/thumbnails/15.jpg)
CartesianProduct
• Pairseveryrecordwitheveryotherrecord– Nokeysneeded– NxMresults,fordatasetsofsizeN,M
• Map-onlyjob• ButsKllexpensivetocompute• Hadoopclass:CartesianInputFormat
CS378-Fall2016 BigDataProgramming 15
![Page 16: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is](https://reader036.fdocuments.us/reader036/viewer/2022081521/5abfbc4e7f8b9add5f8e1853/html5/thumbnails/16.jpg)
CartesianProduct
• Toaccomplishthisjoin,we’llneedtopaireveryrecordwitheveryotherrecord
• Wecanstartwiththeapproachforcompositejoin
• Forcompositejoin,eachmapperreadtwofiles– Theyhadthesamekeyset– Thedatawassortedbykey– Wedon’tcareaboutthekeys,justthe‘twofileinput’
CS378-Fall2016 BigDataProgramming 16
![Page 17: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is](https://reader036.fdocuments.us/reader036/viewer/2022081521/5abfbc4e7f8b9add5f8e1853/html5/thumbnails/17.jpg)
CompositeJoin–DataFlow
CS378-Fall2016 BigDataProgramming 17
![Page 18: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is](https://reader036.fdocuments.us/reader036/viewer/2022081521/5abfbc4e7f8b9add5f8e1853/html5/thumbnails/18.jpg)
OneMapper,TwoInputs• Forcompositejoin,thekeyorderallowedusto:
– Readeachofthetwofilesonlyonce– Workedverymuchlikemergesort
• ForCartesianproduct– Foreachrecordindataset1– We’llreadeveryrecordindataset2– Thispairofrecordsispassedtothemapper
• We’daccomplishthiswithacustominputformat– RecordReaderresetsdataset2foreachinputofdataset1
CS378-Fall2016 BigDataProgramming 18
![Page 19: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is](https://reader036.fdocuments.us/reader036/viewer/2022081521/5abfbc4e7f8b9add5f8e1853/html5/thumbnails/19.jpg)
CartesianProduct–DataFlow
CS378-Fall2016 BigDataProgramming 19