BACHELOR THESIS -...
Transcript of BACHELOR THESIS -...
BACHELOR THESIS
Evaluating Presto as an SQL on HadoopSolution
A Case at Truecaller
Sahir Ahmed2016
Bachelor of ArtsSystems Science
Luleå University of TechnologyDepartment of Computer science, Electrical and Space engineering
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
EvaluatingPrestoasanSQLonHadoopsolution
acaseattruecaller
SahirAhmed–VT2016
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
Abstract Truecallerisamobileapplicationwithover200millionuniqueusersworldwide.Everydaytruecallerstoresover1billionrowsofdatathattheyusetoanalyseforimprovingtheirproduct.ThedataisstoredinHadoop,whichisaframeworkforstoringandanalysinglargeamountsofdataonadistributedfilesystem.Inordertobeabletoanalysetheselargeamountsofdatatheanalyticsteamneedsanewsolutionformorelightweight,ad-hocanalysis.ThisthesisevaluatestheperformanceofthequeryenginePrestotoseeifitmeetstherequirementstohelpthedataanalyticsteamattruecallergainefficiency.Byusingadesign-sciencemethodology,Presto’sprosandconsarepresented.Prestoisrecommendedasasolutiontobeusedtogetherwiththetoolstodayforspecificlightweightusecasesforusersthatarefamiliarwiththedatasetsusedbytheanalyticsteam.Othersolutionsforfutureevaluationarealsorecommendedbeforetakingafinaldecision.
Keywords:Hadoop,BigData,Presto,Hive,SQLonHadoop
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
Table of contents Abstract.............................................................................................................................2
Tableofcontents...............................................................................................................3
Tableoffigures..................................................................................................................5
Tableoftables...................................................................................................................6
Glossary.............................................................................................................................7
1. Introduction................................................................................................................11.1Problemstatement................................................................................................................1
1.1.1Problemdescription...............................................................................................................11.1.2Researchquestions................................................................................................................11.1.3Purpose..................................................................................................................................1
1.2Motivationsforresearch........................................................................................................21.3Delimitations.........................................................................................................................2
2. Background.................................................................................................................32.1Truecaller...............................................................................................................................32.2BigData.................................................................................................................................3
2.2.1WhatisBigData?...................................................................................................................32.2.2DataAnalytics.........................................................................................................................52.2.3DataWarehouse.....................................................................................................................5
2.3Hadoop..................................................................................................................................72.3.1WhatisHadoop......................................................................................................................72.3.2TruecallerandHadoop...........................................................................................................72.3.3MapReduce............................................................................................................................82.3.4HDFS.....................................................................................................................................102.3.5YARN.....................................................................................................................................11
2.3.5.1Slider.............................................................................................................................................132.3.6SQLonHadoop.....................................................................................................................13
2.3.6.1Hive..............................................................................................................................................132.3.6.2Presto...........................................................................................................................................15
3. Literaturereview.......................................................................................................173.1Purposeofliteraturereview................................................................................................173.2Benchmarks.........................................................................................................................17
3.2.1Fast-data-hackathon............................................................................................................173.2.2RenminUniversity................................................................................................................183.2.3Commercialbenchmarks......................................................................................................19
3.2.3.1Pivotal(HAWQ)............................................................................................................................193.2.3.2Cloudera(Impala).........................................................................................................................20
3.3Hivenotmeantforlow-latencyquerying.............................................................................21
4. Researchmethodology..............................................................................................224.1Identifyproblems&Motivate..............................................................................................224.2DefineObjectivesforsolution..............................................................................................234.3Design&Development........................................................................................................23
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
4.4Demonstration.....................................................................................................................244.5Evaluation............................................................................................................................244.6Communication...................................................................................................................24
5. Result........................................................................................................................255.1Clusterdetails......................................................................................................................255.2Implementation...................................................................................................................255.3Usecasesandlatencyinseconds.........................................................................................255.4Resulttable..........................................................................................................................27
6. Discussion..................................................................................................................286.1Prestoattruecaller..............................................................................................................286.2Usecases.............................................................................................................................286.3Limitations...........................................................................................................................286.4Commercialalternatives......................................................................................................286.5Furtherresearch..................................................................................................................29
7. Conclusion.................................................................................................................30
8. References.................................................................................................................318.1Booksandarticles................................................................................................................318.2Websitesandblogposts......................................................................................................32
9. Appendices................................................................................................................339.1AppendixA–Prestoinstallation..........................................................................................339.2AppendixB–Usecasequeries.............................................................................................38
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
Table of figures Fig1.Exponentialgrowthofdata(Fishman,2014,July14).................................................4
Fig2.TheBigdataprocessandsubprocesses(Gandomi&Haider,2015)...........................5
Fig3.Starschema(Connolly&Begg,2005)........................................................................6
Fig4.SimplifiedMapReducedataflow(White,2010).........................................................7
Fig5.MapReduceflowwithonereducer(White,2010).....................................................9
Fig6.MapReducedataflowwithmultiplereducers(White,2010)...................................10
Fig7.HDFSArchitecture(Srinivasa&Mupalla,2015)........................................................11
Fig8.ApplicationsrunningonYARN(White,2010)...........................................................12
Fig9.YARNcomponents(ApacheHadoop,2016)..............................................................13
Fig10.HiveArchitecture(Thusooetal.,2009)..................................................................15
Fig11.PrestoArchitecture(Teradata,2016).....................................................................16
Fig12.fast-data-hackathonresults(TreasureDataBlog,2015,March20).........................18
Fig13.RenminUniversityresultsfor100-nodecluster(Chenetal.,2014).........................19
Fig14.Hawq’sspeedupratiovs.Impala(Solimanetal.,2014).........................................19
Fig15.Cloudera’sImpalabenchmark(Cloudera,2016,February11)................................20
Fig.16DesignScienceResearchMethodologyPeffers,Tuunanen,Rothenberger&Chatterjee,2007)..............................................................................................................22
Fig17.Queryexecutiontimeinseconds...........................................................................27
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
Table of tables Table1.Glossary................................................................................................................7
Table2.Whatisbigdata?(Cervone,2015).........................................................................4
Table3.RDBMScomparedtoMapReduce(White,2010)...................................................8
Table4.Requirements......................................................................................................23
Table5.Clusterdetails......................................................................................................25
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
Glossary Hadoop OpensourceJavaframeworkfor
distributeddatastorageHDFS HadoopdistributedfilesystemYARN (Yetanotherresourcenegotiator)
resourcemanagerforHadoopRDBMS Relationaldatabasemanagement
systemNode Commoditycomputer(hardware)used
byHDFSCluster Agroupofconnectednodesusedby
HDFSHive SQLdatawarehouseforHadoopMapReduce Programmingmodelforprocessingdata
onHadoopPresto SQLontopofHadoopsolutionSlider Applicationtorunnon-Yarn
applicationsonYARN
Table 1. Glossary
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
1
1. Introduction Thissectionintroducesthereadertotruecaller,thecompanywherethisstudyisconductedandgivessomebackgroundinformationaboutthecurrentsituationregardingthedatathatisstoredandanalysed.Followingthatinformationabackgroundregardingthedatawarehouseandtoolsusedfordataanalyticswithintheanalyticsteamoftruecallerisexplained.Hadoopisoneofthemaintoolsusedanddescribedonafairlyhigh-level.Truecallerisamobileappusedbyover200millionusersworldwide.Attruecallertoday,thedailyincomingdata(abouthalfabillionrows)isstoredinasystemdistributedfilesystemusingtheopensourceJavaframeworkHadoop.Thecollecteddataisusedbyallthedepartmentsofthecompanyforimprovingtheproductandtakingnewdecisions.Inordertoanalysethelargeamountsofdata,theData-AnalyticsteamislookingforanewlowlatencysolutionthatcanbeusedasacomplementtotheirstandardHadooptools,whichinsomecasesareconsideredtooslow.Thegoalofthisresearchistoidentify,testandevaluateanumberoftoolsthatfitstheteam’srequirementsandcouldpotentiallybeimplemented.
1.1 Problem statement “Runningad-hocqueriesonHiveistooslowfordatadiscovery”–BjörnBrinne,HeadofData-AnalyticsatTruecaller
1.1.1 Problem description Theproblemwithinthedata-analyticsorganisationattruecalleristhattheuseofHiveisnotfastenoughforad-hocqueryingandon-the-flydatadiscovery.TheusersruntheirqueriesinHiveandlongerquerieswithmultiplejoinsandaggregationscantakeuptoseveralhourstoexecute.Afterthequeryhasrun,theusermightnotbesatisfiedwiththeoutcomeandwillhavetoadjustandrerunthequery,sometimesinmultipleiterations.Thismakesthewholeprocessineffectiveandholdsbackthedatascientistsandotherusersfromusingtheircreativitytoaddmorevaluetotheirjobandcompanybecauseofpoorperformanceofthetools.Italsolimitsthemtojustfollowwhatisprioritizedatthattimeandleadstosomeworknotbeingdoneorpushedreallyfarbackinpriorityand.Thiscanalsomaketheanalyticsorganisationlessinnovative.
1.1.2 Research questions 1.HowmuchfasterthanHiveisPrestoontruecaller’sHadoopcluster?2.WhichtypeofusecasesisPrestosuitable/notsuitablefor?
1.1.3 Purpose Thepurposeofthisprojectistoidentify,testandimplementasolutiontosupportdata-discoveryandad-hocqueryingofdatainthetruecallerdatawarehouse.ThesolutionisaimedtocomplementHadoopandwillhopefullybeimplementedandusedbytheanalytics-teamattruecaller.Thereareseveralcommercialandopensourceproductsaresupposedtoprovidethatfunctionalitytotheusers.Thoseproductsoftenvaryinperformanceanddependontheenvironmentthattheyaresetupin.Therefortheaimandpurposeofthisresearchisonlymeanttosolvetheissuesanduse-casesattruecallerandisnotintendedtobeusedasageneralrecommendationorsolutionforothercompaniestorelyupon.However,theoutcomescanprovidethereaderswithanoverviewthatcan
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
2
becomparedtootherarticlesonthesubject.Themaingoalistoadvicetruecalleronwhichoftheidentifiedsolutionstoimplementanduseasananalyticaltool.
1.2 Motivations for research ThemainmotivationistotestandevaluatenewertechniqueswithinthegrowingBigDatafieldandtocomparethemwiththeindustrystandardtoolHive.Asecondarymotivationistoprovidesupportandhelptothedata-analyticsteamattruecaller,whichisundergoingrapidgrowth,andwishestofindamoresuitabletoolforfasterqueryingandanalysisoftheirdailyincomingdata.
1.3 Delimitations Theprojectwilltakeplacehalftimeunder4monthsandwillbebasedonpre-specifiedusecasesfortruecaller.Onesolutionistobetested.Firstonatestenvironmentlikee.g.avirtualonenodeclusterandasasecondstageintheproductionenvironment.Thesolutionisfreeopensourcetokeepthecostsdown.Thisprojectwillmostlyfocusonhowfastthedataispresentedandwillnotgoindepthonhowthetechniquesworkbehindthescenes,duetothesubjectofInformationSystemsScience.Thedatasetswillbepredefinedandtheaimsotofinallyusethedatawarehouseanddailyincomingdata.
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
3
2. Background
2.1 Truecaller TruecallerisaSwedishmobileapplicationcompanybasedinStockholm.Theirflagshipapplication;truecallerisdevelopedtoprovideuserswithinformationaboutwantedandunwantedcallsfromunknownnumbers.Bydoingthattruecallerhelpstheusersidentifyingspamcallsandmessages,givingthempossibilitiestoblockthosenumbersandalsosharethisinformationwiththerestoftheuserbase(Truecaller,2016).
Truecallerhastodayover200millionusersandissteadilygrowing.ThelargestusergroupislocatedinIndiawheretherearemorethan100millionusers.InIndiaalone,truecallerhelpsidentifyabout900millioncallsand120millionspamcallseverymonth.Thecompanystoresover2billionnamesandnumbersinitsdatawarehousewhereitislateranalysedforfurtheractions.Workingtogetherwithseveraltelecomprovidershelpstruecallertoreachouttonewusersindifferentcountries(Choudury,2015,October12).
InIndiawheretruecallerisverypopular,thenumberofunwanted/spamcallsisabigproblem.Thegovernmenthasspecialregulationsconcerningspamcalls(Rukmini,2014,December13)andthiscanbeonefactortothepopularityofthetruecallerapplication.
Truecallerstoresover1billiondailyeventsintheirdatawarehouse,wherethisdataislaterusedforanalysis.Theseanalysiscanbebasedonuserdata,applicationperformance,A/Btestingetc.andareusedforfurtherdevelopmentoftheapplication.
2.2 Big Data “Therewere5Exabyte’sofinformationcreatedbetweenthedawnofcivilizationthrough2003,butthatmuchinformationisnowcreatedeverytwodays.”-EricSchmidt,SoftwareEngineerandExecutiveChairmanofGoogle(Walker,2015)
2.2.1 What is Big Data? ThedefinitionofBigDataisoftendiscussedandisapopularandrecurringsubjectwithintheITworld.Somechoosetodefine“What”BigDataiswhileotherschoosetolookatwhatit“Does”(Gandomi&Haider,2015).Thedefinitionisthereforenotcrystalclearanduseof“thethreeV:s(Volume,VarietyandVelocity)”hasemergedandisbeingusedtodescribeBigData.VolumehastodowiththesizeandisoftenthefirstthingthatcomesinmindwhentalkingaboutBigDataseefig.2.WhentalkingaboutsizesinBigData,sizesovermultiplePetabytesoftencomestomind(Walker,2015)Keepinmindthatthesesizesincreaseallthetime.ItincreasesevenfasterthanMoore’sLawthatmeansitwoulddoubleinsizeeverytwoyears.Theincreaseofdataactuallydoublesinsizeaboutevery18months(NationalInstituteofStandardsandTechnology,2015).Varietyconcerningthevaryingtypesofdata,whichshowsthatthemajority(over90%)ofthedataisunstructuredduetothedifferenttypesofdatathatisuploadedallthetimethroughdifferentchannelssuchas,photos,videos,tweetsetc.Structureddataisstoredinrelationaldatabasesintables;thisisnotthecasefordatastoredinHadoop.Velocityreferstothespeedofwhichthedataisproduced,storedandanalysedseefig.1.Overthelatestyearswheresmartphoneshavebecomewidelyused,therateofdatavelocityhasexplodedandtheneedsforreal-timeanalyticsareinhigh-demand.Usingdatafromendusersfromasmartphonesvendorscananalysedatacollectedfromthehandhelddevices
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
4
suchasgeospatiallocation,phonenumbers,demographicsetc.(Gandomi&Haider,2015).Thesearethethree“main”V:sbutovertimeadditionalV:ssuchasVariability,Value,Validity,Veracityetc.havemadeitintodiscussionindefiningBigDataanddataanalytics(NationalInstituteofStandardsandTechnology,2015).
Fig 1. Exponential growth of data (Fishman, 2014, July 14).
Table 2. What is big data? (Cervone, 2015). HereisanexampleofoneofmanydefinitionsofBigData:“BigDataconsistsofextensivedatasets-primarilyinthecharacteristicsofvolume,variety,velocity,
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
5
and/orvariability-thatrequireascalablearchitectureforefficientstorage,manipulation,andanalysis.”(NationalInstituteofStandardsandTechnology,2015)
Sowhat“Does”BigDatado,whenisitusedandbywho?Wetakealookatitinnextparagraph.
2.2.2 Data Analytics BigDataisuselessifitisjuststoredandneverused.Themainreasonforstoringallthedataistoanalyseitandsomehowmakeuseofit.Withthehelpofdataanalysis,companiescanmakeimportantdecisionsfortheirprocesses(Walker,2015).Thewaysofanalysingdataisnotsodifferentfromthepast,usingstatisticalmodelsandformulas.Thedifferenceismostlyaboutthesizeofthedatasetsandthespeedofanalysinglargeramountsofdata.Also,withthehelpoftoolssuchasHadoop,companiescanfocusmoreoncausationbecauseoftheavailabilityofdatathatcanbeprovided.Theuseofcausationhelpsustodeterminewhysomethingishappeningwhichleadstomakingbetter(NationalInstituteofStandardsandTechnology,2015).AccordingtoLambrinisandJagadish(2012)(Walker,2015),theprocessforextractinginsightsofdataisdividedintofivestagesthatarepartofthetwosubprocesses,data-managementandanalyticsseefig.3.Thefirstthreestagesarewithinthemorebackendprocessdata-managementandcoversacquiringandstoringthedata,cleaningandpreparingitandthenpresenting/deliveringittotheanalysisprocesswherethedataisactuallyanalysed,interpretedandvisualized(Gandomi&Haider,2015).
Fig 2. The Big data process and sub processes (Gandomi & Haider, 2015).
2.2.3 Data Warehouse Adatawarehouseisadataset/databasedatacollectedfromoneormultipledatasourcesusedtoanalysehistoricaldatatohelporganisationswithdecision-making.In1990BillInmonintroducedthetermdatawarehouse,andin1993hestatedthatdatawarehouseisdefinedas:
- Subject-oriented:Meaningthatthedatafocusesonspecificsubjectsregardingthebusinesse.g.customers,productsetc.
- Integrated:Becausethedataoftencomesfrommultiplesourcesthedatahastobere-structuredtoholdaconsistentpatternofthedatawarehouse.
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
6
- Timevariant:Thedataisfetchedindifferenttimesanddatesithastobeshownsothatitcanmakesensefordifferenttimeintervalsandtimerelatedmatters.
- Non-volatile:Thedatagrowsincrementallyanddoesnotreplacetheolddatabutgrowssteadilyovertime(Connolly&Begg,2005).
DatawarehousesareoftenconnectedtoOLAP(OnlineAnalyticalProcessingTools)toolsforanalysis,miningandmachinelearning.TheprocesswhenadatawarehouseretrievesnewdatafromitsdatasourcesiscalledETL(extracttransformload).Datawarehousesareoftenmodelledusingastar-schema,whichdiffersfromanormalrelationaldatabasemodel(Jukić,Sharma,Nestorov&Jukić,2015).Astar-schemaisamultidimensionalmodelseefig.4usingfacttablesthatrepresentsabusinessobjecte.g.apurchase.Thefacttablesareconnectedtomultipledimensiontableswithspecificgrainsthattogethermakeupthefactsinthefacttablese.g.aproductdimension,timedimensionetc.Togethertheycanshowwhatproductwaspurchasedatwhattimeanddateetc.Datawarehousescanbemodelledwithdifferentschemastooe.g.snowflake-schema,OLAPcubeetc.(Connolly&Begg,2005).
Fig 3. Star schema (Connolly & Begg, 2005).
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
7
2.3 Hadoop 2.3.1 What is Hadoop HadoopwascreatedbyDougCuttingbetween2003and2006.CuttingwasworkingonanopensourcesearchenginecalledNutchwherehewasinspiredbyGoogle’stechnologiesGFS(GoogleFileSystem)andMapReducethattogethermadeupaplatformforprocessingadataefficientlyonalargescale(White,2010).FromthosetechnologiesCuttingstarteddevelopingHadoopthatquicklybecameaprioritizedprojectwithinApacheopensourcefoundation.YahoowhosoonhiredCuttingstronglysupportedtheprojectwithhelpandresources.ThenameHadoopcomesfromCuttingsson’syellowtoyelephantandisalsowhythelogoforHadooplooksthewayitdoes.By2008,HadoopwasbeingusedbyseverallargecompanieslikeAmazon,Facebooketc.(Turkington,2013).
HadoopisaJavabasedframeworkfordistributeddatacomputationandstoring.It’slinearscalabilityandcheapinfrastructuremakesiteasytomaintainandupgrade,hencemakesitthemainactorforBigDataprocessing.Hadoopismadeupoftwomaintechnologies,HDFSandMapReduce,whicharecompletelydifferentbutcomplementeachother.ByusingHadoop,largesetsofdatacanbecomputedinparallelonallthenodesofthecluster.Thiswayallcomputingpowerandresourcescanbeutilizedefficiently(White,2010).
2.3.2 Truecaller and Hadoop TruecallerstoresallitsdatainHadoop.HadoopprocessesthedatabyMapReducejobsrunninginparalleloveraclusterofdistributednodesusingHDFS(HadoopDistributedFileSystem).MapReduceisaprogrammingmodelthatsplitsthedataprocessesinaquerywheretheMapstagesortsandpairsthedatainkey-valueformatandthenthedataissummarizedandpresentedintheReducephaseseefig.5.
Fig 4. Simplified MapReduce dataflow (White, 2010).
ThemainadvantageofusingHadoopinsteadofarelationaldatabaseforstoringandanalysingBigDataisbecauseofcostandperformance.HadoopisrelativelycheapbecausethenodesarenormalcommoditycomputersthatinteractinHDFS.Itisalsolinearlyscalablewhichmeansthatwhenitisneededtoaddmorecapacityorspace,addingnewnodestotheclustercaneasilysolvetheissue(White,2010).WhenusingaRDMSthedatahastobestoredondiskandinmemory,whichcanbeproblematicwhenitcomestolargeamountsofdatae.g.Petabytes.Therefore,spreadingtheworkloadonmultiplenodeshelpssolvethisissue(Chenetal.,2014).However,therearesomeotherdifferenceswhereHadooplosesagainsttheRDBMSseetable.1.ThetypesofqueriesthataremainlysupposedtobeexecutedinHadoopareMapReducebatchjobs.Hadoopwasnotmeanttobeused
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
8
forinteractiveon-the-flyad-hocqueriesordatadiscovery.RDBMSarebetterforad-hocqueriesandprovideamuchmoreinteractiveuserinterfacethanHadoopwasintendedtodo.
Table 3. RDBMS compared to MapReduce (White, 2010).
InrecentyearsmanynewSQLonHadoopenginesthatletstheuserhaveanSQLinterfaceandlanguagetostructureandquerythedatawhileusingMapReduce,havebeendevelopedtosupportandprovidead-hocquerying.HiveisnowmoreorlessanindustrystandardtoolforSQLonHadoop(White,2010)(Chenetal.,2014).Hiveisusedasthemaindatawarehousetoolattruecaller.
2.3.3 MapReduce MapReduceisaprogrammingmodelusedforprocessingbigdatadistributedandinparallelacrosstheHadoopcluster.MapReduceisoneofthetwomainpartsofHadoopandcanbewritteninmanydifferentprogramminglanguages.Asnotedearlier,theMapReducemodelconsistsoftwodifferentphasesthe;
a. Map,whichtakestheinputdataand,b. Reduce,whichoutputsthefinalresult.
InMapstagetheinputdataisdividedintosplits,whichhaveonemaptaskforeachsplit.Themapfunctionthenrunsforeachrecordinthesplit,andYARN,whichisHadoop’sresourcemanagerschedulesthetasksanddividesthemamongthenodesinthecluster.IfajobfailsYARNseestoitthatitisre-runonanothermachine.HavingtherightsizeofsplitshelpstheperformanceoftheMapReducejob,becausewhenthesizeofthesplitsaretoobigthejobwillbedividedintofewermachinesandwilltakelongertoprocess,whileiftherearemanynodesandthesplitsaresmallthejobcanbedividedamongmanymachineswhichcanprocesstheminparallel(Maitrey&Jha,2015).Thisissomethingtheuserhastosetuprightfortheusedenvironment.Normally,asplitwillbe128MB,likethesizeofaHDFSblock,whichisthedefaultsize.Ablockandasplitisnotthesamething.Ablockisaharddivisionofthedata,whichispre-set.Becausearecordofdatadoesnotalwaysfitinoneblockorcanoverlaptoanotherblock,theMapReducejobusessplitskeeptrackofthedifferentrecordstoprocess.ThenumberofsplitsforaMapReducejobcanthereforenotbepredefinedbutisentirelydependentofthesizeofthedatatobeprocessedinthejob(White,2010).
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
9
Forbestresults,HadooptriestorunthemapphaseonthesamenodeaswheretheinputdataisstoredonHDFS.Thisway,networkbandwidthisnotwasted.IfthatnodeisbusyrunningothermaptasksHadoopchoosesanothernodeintheclusterwiththereplicateddata.Thenameforthatprocessiscalleddatalocalityoptimization.
Asmentioned,foreveryinputsplittherewillbeonemapperandthemapperoutputsakey-valuepairasaresult.Thesekey-valueoutputsarewrittentothelocaldiskonthenodeandnotonHDFS.Thisisbecausetheresultofthemaptaskisonlyfortemporaryuse,andoncethefinalresultofthereducetaskiscompleted,thatdatawillbedeletedfromthelocaldisk.Thereducephasedoesnothavethedatalocalityoptimizationbecauseareducerreceivesdatafrommultiplemapperswhichhavedatafromdifferentnodes.Afterthemapphasethekey-valuepairforeveryrecordisgroupedbythekeywithalistwiththevaluesbeforefilteredtotherightoutputinthereducephase(Turkington,2013).ThisistheprocedureforaMapReducejobwithonlyonereducerseefig.6.Ifdealingwithalargesetofdata,itisadvisabletousemorethanonereducertoavoidperformanceissuesandbottlenecks.
Thephasebetweenmapandreduceiscalledtheshufflephase.Theshufflephaseisveryimportantbecauseitiswherethekey-valuepairsaresortedandmergedtogether,creatingthekeywiththelistofvaluesthatareusedinthereducephasebeforethefinaloutput(White,2010).Itcomesintofocuswhenthejobsareusingmorethanonereducerseefig.7.
Fig 5. MapReduce flow with one reducer (White, 2010).
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
10
Whenusingmorethanonereducer,themapperspartitionsthedata,onepartitionforeveryreducer.Thisway,intheshufflephasewhendataissortedandmergedtherightkey-valueoutputgoestotherightreducer.Withoutthepartitioningtheshufflephasewouldnotknowwhichreducerthedatashouldbesentto,whichwouldresultinhavingsamekeysspreadoutondifferentreducerswhichwouldoutputaresultforeachreducer.Thiswouldresultinanoutputthatwillbeprocessedineffectively,andwouldquitelikelybewrong(Maitrey&Jha,2015).
Fig 6. MapReduce data flow with multiple reducers (White, 2010).
2.3.4 HDFS TheHadoopdistributedfilesystemorHDFSisthesecondandotherimportantpartofHadoop,togetherwithMapReduce.HDFSisthefilesystemusedonaHadoopclusterandisusedforstoringlargesetsofdataonaclusterofdistributednodes.OneofthekeyfeaturesofHDFSisthatitisrunoncommodityhardwarenodes,whichmeansthatisnormalpricedhardwarethatcanbeboughtinmultiplevendors,andnotspeciallydesignedforHadoop.HDFSisdesignedtokeeprunningifonenodefailsbecausethatissomethingthathappensfrequentlyandmustbeaccountedfor.
Thedataprocessingisstreamprocessedwhichmeansitiswrite-once,read-many-times.Thisconceptismeantthatthedatasetsusedarewrittenonceandthenusedforprocessingovertimeandtherefortheimportanceofreadingthedatasetsisbiggerthanlowlatencyfortheoutputresults.Thehighthroughputofthedatacomeswiththecostoflatency.ThelimitforthenumberoffilesstoredinHDFSisdependantofthememorysizeofthemasternode(s),alsoknownasthenamenode.ThenamenodestorestheMetadataofthefilesysteminitsmemory(location,typeetc.)andisusedtolocatethefilesonthedifferentworkernodes(White,2010).
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
11
Thefilesystemworksbyusingnamenodesanddatanodes(master/slave)seefig.8.InitiallythereusedtobeonenamenodeinHDFS,butinlaterversionsofHadoop,theuseofmultiplenamenodeshavebeenintroducedforstabilityandoff-loadpurposes.Todaynormally,thereareprimaryandsecondarynamenode(whichareinstandbymode).ThenamenodemanagesthefilesystemtreeandmetadataandkeepstrackofallthefilesanddirectoriesontheHDFS,whicharestoredontherestofthenodes,whicharethedatanodes.ThelimitationforstoringfilesinHDFSisthememoryonthenamenode,whichholdsthereferencetoallthefilesinthesystem.Aslongasthememoryonthenamenodeisnotfull,nodesanddiskspacecanbeaddedtothecluster,whichmakesitscalable.ThedatanodesaretheworkersintheHDFSsystemandallthedataonHDFSisstoredinblocksonthedatanodes.Assoonasachangeoccurstodataonadatanodethatisfoundinthenamespace,itisrecordedonthenamenode(Maitrey&Jha,2015).Theuseofstoringdatainblockshastheadvantagesthatfileslargerthanthedisksonthenodesdoesnotbecomeaproblem,sinceitcanbedistributedamongmultipleblocksinHDFS.Theblocksarereplicatedanddistributedontheclusterforbackupandfailureconcerns.Thedefaultreplicationfactoris3,whichmeansthatablockwillbereplicatedandspreadthreetimesonthecluster.ThismakesHDFSveryreliablewhenanodefailsorgoesdown(Apache,2016)(Srinivasa&Mupalla,2015).
Fig 7. HDFS Architecture (Srinivasa & Mupalla, 2015)
2.3.5 YARN ApacheYARNor(YetAnotherResourceNegotiator)isHadoop’sresourcemanagementsystemandwasincludedwithHadoop2.0.YARNisusedforbothresourcemanagementacrosstheclusterandalsofornodemanagement,managingtheresourcesforeverysinglenodebyusingcontainers.The
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
12
containersareusedforrunningapplicationsontopofYARNandhavetheirownallocatedresourcessuchasCPUandmemory.ApplicationscanbebuiltonYARNusingdifferentAPIs,andYARNisoftennotsomethingthattheenduserhastobeconcernedabout.ApplicationsthatuseYARNasanintermediatelayerbetweenthemandthestoragelayerarecalledYARNapplicationsseefig.9.ThiswayYARNmakesiteasiertorunapplicationsontheclusterandseestoitthattheresourcesareallocatedright(White,2010).
Fig 8. Applications running on YARN (White, 2010) YARNconsistsof:
a. Theresourcemanagerb. Nodemanager,andc. Theapplicationmaster.
Theresourcemanagerlooksaftertheclustersresourceswhilethenodemanagerhandlesthenodesandcontainers.Theapplicationmaster(oneperapplication)negotiatesresourcesfromtheresourcemanagersoandtogetherwiththenodemanagersithelpstheapplicationrunandcomputetasks.
Theresourcemanageritselfconsistsof:
a. Thescheduler,andb. TheApplicationManager.
Thescheduleristhecomponentthatschedulesandallocatesalltheresourcesneededforrunningapplicationsbasedonwhatresourcesareneededandwhichcanbefoundinthecontainers.ApplicationManagerensuresthattheapplicationgetstherightcontainersfromtheschedulersothattheapplicationcanrun.Whilerunningitisalsoresponsibleforrestartingwhenfailuresoccur,andtotrackandmonitorthestatusoftheapplicationseefig.10(ApacheHadoop,2016).
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
13
Fig 9. YARN components (Apache Hadoop, 2016).
2.3.5.1 Slider SliderisaYARNapplicationdevelopedtorunnon-YARNapplicationsonYARNandtosimplifythisprocess.Withslider,differentusers,usingdifferentinstancesoftheapplication,canrunanapplicationonYARN.Applicationscanalsobestopped,killedandrestartedetc.Anexampleofanon-yarnapplicationthatcanberunwithslideristheHadoopqueryenginePresto(ApacheSliderIncubator,2016)(White,2010).
2.3.6 SQL on Hadoop SQL(StructuredQueryLanguage)hasbeenusedforanalysingdatainrelationaldatabasesforalongtimeandisanindustrywidestandardlanguagewithinBusinessIntelligencetoo.InordertohelpuserswithoutsoftwareengineeringskillstouseHadooptoanalyseandqueryitsdata,differentSQLontopofHadoopsolutionshavebeencreated.Moreandmorecompetingsolutionsemergenowandthentotrytogivetheuserstheeasiestandfastestsolutiononthemarket(Turkington,2013).AccordingtoT.T.MaposaandM.Seth(2015)thefutureofSQLonHadoopisverybrightbecauseofthesimplicityandwideuserrangeofSQLandthepoweroftheHDFS,togetherformingaverysoughtaftersolution(Maposa&Sethi,2015).
2.3.6.1 Hive HiveisanopensourcedatawarehousingsolutionforHadoopandhasanSQLlikequeryinglanguageHiveQL.HivewasdevelopedatFacebookin2007andisaverybroadlyusedtoolthathasbecomeanindustrywidestandard.HivecompilestheHiveQLstatementsandintoMapReducejobsandreturnstheresultstotheuserinaSQLlikemanner.TheHiveQLlanguagehasmanyofthestandardSQL
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
14
functionsanduserscanaddcustomizedfunctions(UDFs)toaddmorefunctionality.Primitives,strings,arrays,mapsandstructsaremanyofthedatatypessupportedinthetables,whichisoneofmanyreasonsforHive’spopularity.Hivecanberunbothinteractivelylikeanormaldbmse.g.MySQLandthroughbatchjobsinthebackgroundorafteraschedule.ThestandardinterfacesarecommandlineandawebGUIcalledhue.
ThedatamodelofhiveisorganizedintableswithacorrespondingdirectoryinHDFS,partitionsthataredividedintosubdirectoriesinthetabledirectory,andbuckets,whicharethenextlevelofpartitioningperpartitionforfasterprocessingofdata.
ThearchitectureofHiveconsistsofseefig.11:
a. Theclientforuserinteractionb. TheHiveServer2(Thrift)serverthatinterpretsthedifferentexternallanguagesagainstHive.c. TheMetastorethatisthesystemcataloguethatstorestheMetadataaboutallthe
databases,tablesandpartitionscreatedbyinHive.d. TheDriverhandlestheHiveQLstatementduringcompilationandexecutionandeverystep
betweentoandfromtheclientgoesthroughtheDriver.e. TheCompilerthattranslatestheHiveQLstatementfromtheDriverintoaplanthatsimply
speakingconsistsoftheMapReducejobsthataresubmittedtothe(Hadoop)executionengine.(Thusooetal.,2009)(White,2010)(Turkington,2013).
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
15
Fig 10. Hive Architecture (Thusoo et al., 2009).
2.3.6.2 Presto PrestoisanSQLenginethatalsowasdevelopedbyFacebookanditsmainpurposeislow-latencyad-hocquerying.PrestousesANSISQLassyntax.ItcanrunonHadoopbutalsoonmanyotherdatabaseengines.WhenconnectedtoHadoop,PrestousestheHiveMetastoretofindthedataonHDFS.ThemaindifferencebetweenPrestoandHiveisthatPrestodoesnotprocessthequeriesusingMapReducebutinsteadallocatesallthetaskstotheworkers(nodes)memory,andschedulesnewtaskstotheworkersonceataskisfinished(Teradata,2016).ThismakesPrestomoresuitableforinteractivequerying,especiallywhentheworkersrunoutofmemorybecauseofthesizeofthedataprocessedinaquery.ThearchitectureisbuiltoftheclientthatsendsthequerytoacoordinatorthatparsestheSQLstatement,plansandschedulestheexecution.Theexecutionisthenpipelinedanddistributedtotheworkerslocatedclosestthedata.Theexecutionrunsmultiplestagessimultaneouslyanddataisstreamedbetweenallthestagesonlyusingmemoryforlow-latencyoutput(Presto,2016).
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
16
Fig 11. Presto Architecture (Teradata, 2016) .
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
17
3. Literature review
3.1 Purpose of literature review TherehavebeenmanySQLonHadoopbenchmarkscarriedout,totrytofigureoutwhichoneofthemanyprojectstogofor.Manyofthesebenchmarksaredonebyspecificvendorstryingtomarkettheirownproduct.Forthatreasontheyarenotalwaystrustworthywhenlookingtoimplementasolutionatanorganisation.Thevaryingresultsoftendependonthetestedinfrastructure,architectureandoptimizationsmadeforthespecificsolutions.Anotherfactorisalsowhenintimethetestswereconducted.Theopensourceprojectsareinconstantdevelopmenttoimproveandbecomemarketleaders,andnewversionsofthesolutionwillthereforebemoreoptimizedthanthepreviousones.Thiswillbeclearerafterreadingtheresultsfromthedifferentbenchmarkingattemptsthatwillbepresented.Mostofthetestedenginesinthebenchmarksbelowwillnotbeincludedinthebenchmarkingattruecallerbutwillshowthediversityofresults.ThereforIthinkitisimportanttoshowthedifferencesintheresultsandhowtheseresultscannotbetheonlysourceforchoosingandtrustingasolutionbyonlyreadingabenchmark.OnethingthatiscommoninthebenchmarksinthischapteristhatthattheyallbeatHiveinperformancebecausetheydon’tuseMapReducefordataprocessing,butmostlyprocessthedatainmemory.IhaveonlychosenPrestofromthemanyenginesthatarementionedinthissectionbecauseitmetalltherequirementsthattruecallerhad.
3.2 Benchmarks 3.2.1 Fast-data-hackathon Theallegro.techblochreportsresultsfromthefast-data-hackatonwhereagroupofuserscomparedvariousSQLonHadoopsolutionsona4nodetestclusterinJune2015.Theyused11differentqueriesintheirtests(AllegroTech,2015,June25).Thetestedsolutionswere:
- HiveonTez- Presto- Impala- Drill- Spark
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
18
Fig 12. fast-data-hackathon results (Treasure Data Blog, 2015, March 20).
IntheresultsaboveImpalaanddrillweretheclearwinnerswhenitcomestoexecutiontimebutaccordingtothegroupconductingtheteststheybothlackedsomefunctionalitythatcouldbefoundinothersolutionssuchasHiveandPresto.
3.2.2 Renmin University In2014attheRenminUniveristyinBeijing,ChinaagroupofresearcherstestedfivedifferentSQLonHadoopsolutionstocomparethemandbecauseofHivebeingthestandardtoolandnotbeingefficientenoughforinteractivequerying.Theyused11differentquerieswithdifferentcomplexityon25,50and100nodeclusters.Thesolutionstestedwere:
- Impala- Presto- Hive(toshowdifference)- Stinger- Shark
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
19
Fig 13. Renmin University results for 100-node cluster (Chen et al., 2014).
Theresultsshownotonlythedifferentresponsetimes,butalsothatsomequerieswerenotabletorunonallsolutionsexceptHive.Accordingtotheteststheresearchersconcludedthatnoneofthesolutionswerematureenoughtobeusedonadailybasis(Chenetal.,2014).
3.2.3 Commercial benchmarks
3.2.3.1 Pivotal (HAWQ) PivotalthatdevelopedtheSQLonHadoopsolutionHAWQbenchmarkedtheirproductagainstCloudera’sImpala,PrestoandStingerona8nodeclusterinJune2015.
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
20
Fig 14. Hawq’s speed up ratio vs. Impala (Soliman et al., 2014).
TheresultsinthegraphshowshowHAWQbeatsImpalainthemajorityofthequeriesruninthetestcases(Solimanetal.,2014)(Pivotal,2014,June25).ItisworthhighlightingthatImpalahasbeentheclearwinnerwhenitcomestospeedinallthegraphsabove.
3.2.3.2 Cloudera (Impala) InFebruary2016,ClouderareleasedtheirlatestImpalabenchmarkcomparingImpalatoHiveonTezandSparkona21nodecluster.
Fig 15. Cloudera’s Impala benchmark (Cloudera, 2016, February 11).
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
21
ThegraphaboveisoneofmanygraphsonCloudera’sblogshowinghowImpalabeatsitscompetitorsinallways(Cloudera,2016,February11).Itisworthnotingthatthisisanormalscenariocomingfromthevendorordeveloperstryingtoprovetheefficiencyoftheirproducts.
3.3 Hive not meant for low-latency querying BecausehiveisbasedonHDFSandHadoop,itismeantforbatchjobsfocusingonhighthroughputandscalabilityandcanbeusedforbothanalysisandETL.BecauseofHive’shighfaulttoleranceandwideusability,ithasbecomeandindustrywidestandardasadatawarehousetoolforHadoop(Turkington,2013).Italsothereforpreferredforbeingreliableandstableandotherlow-latencysolutionsareusedonthesideforad-hocquerying(Cloudera,2016).
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
22
4. Research methodology ThisthesiswillbethestartofaprocesstoexploreandtestSQLonHadoopqueryengines.ThescopeofthisthesiswillbetheevaluationofthequeryenginePresto.Prestowillbeevaluatedagainsthiveandthemainfocuswillbeonthelatencytimeoftheusecasequeries,Secondlyitwillalsoevaluatethetypeofusecasesitissuitableforandidentifyingandcomparingtheprosandconsagainstthebaselinesolutionhive.
Themethodologyusedinthisthesiswillbedesign-scienceandthedesign-scienceresearchprocessinterpretedbyPeffersetal.(2007)asthestepsinvolvedaresuitedforidentifying,implementingandevaluatingmychosensolution.Themainfocususingthismethodologywillbeontheevaluationofthesolutionratherthanbuildinganartefact,focuswillbeondeployingandconfiguringPrestothatisanalreadydevelopedopensourcequeryengine.Definingtherightusecaseswillalsobeabigpartofthedesign.
Thestepsindesign-scienceresearchprocessare:
1. Identifyproblem&Motivate.2. DefineObjectivesforsolution.3. Design&Developement.4. Demonstration.5. Evaluation.6. Communication.
Fig. 16 Design Science Research Methodology Peffers, Tuunanen, Rothenberger & Chatterjee, 2007).
4.1 Identify problems & Motivate Thefirststepinthedesign-scienceresearchprocessistoidentifyandmotivatetheproblems.Attruecallerthemainproblemisthelatencyandexecutiontimeofhivewhenrunningad-hocqueries.
Problems:
1. Hive/MapReduceistooslowfordatadiscovery/ad-hocqueries.
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
23
Asnotedearlier,theproblemwiththecurrentsetupattruecalleristhatthedatawarehouseisaccessedthroughHiveusingMapReduceonHadoop.Runningcomplexqueriestakesoftenunnecessarylongtimeandthisisbothbadfortheanalyticsteamandforthestakeholdersthatrequestad-hocanalysisonaregularbasis.MapReduceisbetterforbatchjobsthatcanbescheduledforstandardanalysisratherthanad-hocqueryinganddatadiscovery.HiveandMapReducetakestimebecauseoftheprocessingisdonebywritingtheoutputondisk.
4.2 Define Objectives for solution Theobjectivesforthesolutionwillinthiscasebetherequirementsforidentifyingthenewsolution.
Requirements:
Table 4. Requirements
Beforetryingtoidentifythedifferentsolutions,discussionswillbeheldwiththeendusersattheanalyticsteamtogatherrequirementsregardingthesolution.Therearemanyopensourcesolutionsthatcanbetested,andwiththerightrequirements,thelistcanbenarroweddown.
Heretheusecaseswillbedefinedbasedoneverydayad-hocanalysiscarriedoutbytheanalyticsteam.Theusecaseswillincrementallybecomemorecomplextocomparethetimeandresourcesusedbythesolutionswitheachother.AninitialbaselinewheretheusecasesareruninHivewouldbethesecondstepsothatthesolutionscanbecomparedwiththecurrentsituation.AlltheusecaseswillbetestedinsolutionssupportingSQL-syntaxandthequerieswillbethesameorvaryinnativesyntax.(Eg.HiveqldoesnotsupportallstandardSQLfunctionsandSQLcanalsovarydependingontheengine/dialect.)
4.3 Design & Development Inthisstepdesigninganddevelopingandartefactisthecommonpracticeindesign-science.Inthiscase,researchingandIdentifyingthesolutionwillbethepre-stepbeforedeployingthesolutiononfirstatest,andafterontheproductionenvironment(Hadoopcluster).
PrestowasthechosensolutionbecauseitmetalltherequirementsthatwasmentionedaboveandwaseasyenoughtodeployusingtheYARNapplicationversion.
Thefirststepwillbetotestthesolutiononasmallertestenvironmentbeforeimplementingittotheproductionenvironment.ForPresto,thetestenvironmentwillbeadockercontainer,whichisatype
1.Opensource2.SQLinterface3.SupportstoragefiletypeORC4.SupportsamedatatypesasHivee.g.structs,array,Map5.UDF/UDA(UserDefinedFunctions/Aggregations)6.Standardanalyticalfunctions7.Windowfunctions
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
24
ofvirtualenvironment.ThecontainerwillhaveaonenodeHadoopclusterinstalledandreadytoteston.Thetestsetupwasinstallingthesolution,importingadatasetandtestrunningsomerandomqueries.
4.4 Demonstration Thisstepwillbeforpreparingtheevaluationofthesolution.Inthiscasedefiningtheusecasesfortheevaluationisthemaingoalofthisstep.
Theimplementationofthesolutionwillnotbeheavilydocumented,eventhoughabigpartofthetimeinthisprojectwillbespentondeployingandconfiguringthesolutiontothecurrentenvironmentattruecaller.Thetestingintheproductionenvironmentwillbebasedontheusecasesthatwerepre-defined.ThisstepwillshowhowthesolutiondiffersfromHive,andwhattheprosandconsfortheparticularqueriesare.Theusecaseswillberun4times.2timesduringworkhourswhenthereisnormalloadontheclusterand2timesduringtheevening.Theaveragetimeinsecondswillbeusedtoshowtheresultofthelatencytime.
4.5 Evaluation Thisstepistoevaluatetheresultsofthesolution.
Theresultsoftheusecaseswillberecordedandvisualizedusingbarchartstoshowthedifferenceinlatencytimebetweenthedifferentsolutions.Theresultchapterofthethesiswillbemainlyfocusingonthisstepofthemethodology.Otherconclusionsfromtheusecaseswillalsobedrawnfromthebaselineandthenewsolutione.g.prosandconsandsuitableusecases.
4.6 Communication Thethesisitselfrepresentsthisstep.Togetherwiththethesis,theconclusionsandrecommendationsforfurthertestingattruecaller,apresentationatLTUscampuswillbeheldforsupervisorsandotherstudents.
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
25
5. Result
5.1 Cluster details TheHadoopclusterattruecallerconsistsof14nodeswithfollowingcomponents:
OS Linux(debian-kernel)3.2.0.4-amd64CPUmodel Intel(R)Xeon(R)[email protected] 32RAM 128GBHDD ~20TB
Table 5. Cluster details
5.2 Implementation Prestowasdeployedontruecaller’sHadoopclusterusingaversionthatallowsittorunasaYARNapplicationusingall14nodes.OnenodesharedasbothCoordinatorandWorkerandtherestasWorkers.Thewholesetupandconfigurationcanbefoundunderappendices7.1Prestoinstallation.
5.3 Use cases and latency in seconds 1.Select*withrestrictionandLimitationtype1.Tablesize14Billionrows.
Hive:~20sec
Presto:~5sec
Conclusion:
Presto4xfaster.
2.Select*withrestrictionandLimitationtype2.Tablesize14Billionrows.
Hive:~40sec
Presto:~7sec
Conclusion:
Presto~6xfaster.
3.Select*withrestrictionandLimitationtype3.Tablesize14Billionrows.
Hive:~200sec
Presto:~15sec
Conclusion:
Presto~13xfaster.
4.Selectcountwithrestriction.Tablesize14Billionrows.
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
26
Hive:~90sec
Presto:~23sec
Conclusion:
Presto~4xfaster.
5.Select*withrestriction,limitationandOrderBy.Tablesize14Billionrows.
Hive:~340sec
Presto:RunsoutofmemorybecauseofOrderBy,tooexpensive
Conclusion:
Notsuitableusecaseforpresto.
6.Selectwith2aggregatefunctions.Tablesize14Billionrows.
Hive:~600sec
Presto:~120sec
Conclusion:
Presto~5xfaster.
7.Selectwithaggregates,groupbyandlimitationtype1.Tablesize14Billionrows.
Hive:~120sec
Presto:~40sec
Conclusion:
Presto~3xfaster.
8.Selectwithaggregates,groupbyandlimitationtype2.Tablesize14Billionrows.
Hive:~120sec
Presto:~50sec
Conclusion:
Presto~2,5xfaster.
9.Selectwithmultipleaggregates,groupby,limitationsandJointable3times.Tablesize14Billionrows.
Hive:~6000sec
Presto:~7000sec
Conclusion:
Hive1,2xfaster.
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
27
5.4 Result table
Fig 17. Query execution time in seconds
20 5 40 7 200 15 90 23 340 600 120 120 40 120 50 600070000
50
100
150
200
250
300
350
400
450
500hive
presto
hive
presto
hive
presto
hive
presto
hive
presto
hive
presto
hive
presto
hive
presto
hive
presto
q1 q2 q3 q4 q5 q6 q7 q8 q9
Second
s
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
28
6. Discussion
6.1 Presto at truecaller LookingatthequerieswherePrestoperformedfasterthanHive,theaveragegainoflatencytimeisaround5timesfaster.Thisisabitlowerthansomeexamplebenchmarksthatcanbefoundonline.Howeveritisstillabiggainwhenneededtorunqueriesoverandoveragainduringwork.Prestoshowedclearlythatitismoresuitedforreturningfastinsightsonasmallerscaleandissuitedforad-hocanalysisforsomeonewhoisfamiliarwiththedatasetsandknowswhattoandwheretolookfor.Itislesssuitedformoreexpensivequerieswheretheworkerneedstostorewithlargeamountsofmetadata,andriskstorunoutofmemory.Intheconfigurationandtesting,allocating70GBofmaximumquerymemorywasnotenoughforquery5whichcontainedanorderbyoperation,andranoutofmemory.Prestoisalsonotsuitedformultiplejoinsbetweenlargetables,whereitperformedslowerthanHive.Intruecaller’scasePrestoisfineformanyofthecommonday-to-dayad-hocusecases.Attruecallermultipletoolsareusedwithintheanalyticsteamandknowingtheprosandconsofthetoolshelpstheuserstochoosewhichonetogofor.ForsomeETLjobs,Sparkisbeingusedandtherefortheteamwillnotsufferfromaddinganothertooltotheirtoolbox.Prestohassomeminorsyntaxdifferencesinbuttheyareeasytolearnandshouldnotbeseenasanobstacle.AnotherfindingfromthisresearchisthatPrestocanbeconfiguredandtunedonverydetailedlevel.Theresultspresentedinthegraphareafternumerousiterationsofconfigurationsandtuning.Thereisalsothestandardversion,whichrequiresmoreconfigurationsonanoperatingsystemlevel.TheversiontestedattruecallerwastheYARNapplicationversion.ThereforethereismoreroomfortuningandtestingPrestobeforechoosingthefinalsettingsfortruecaller’sHadoopclusterandenvironment.MyrecommendationwouldbeaftertuningandconfiguringPresto,tousePrestoforsomemorelightweightqueriestogetherwithHiveforthebiggerjobs(ETL,batchetc.).
6.2 Use cases TheusecaseschosenfortheevaluationarebasedonworkIdidfortruecalleronthesideofthethesiswork.Thequeriesusedareactualad-hocrequestsfortheanalyticsteam,whichItookupontolearnmoreaboutthedatathatwasanalysedonadailybasis.ThatwayIgotagoodpictureaboutwhattypesofquerieswereoftenrunandwillberuninanewsolutionifimplemented.
6.3 Limitations Thegoalfromthebeginningwastoevaluate2-3solutionsandcomparethemtoHivelikeIdidwithPresto.Duetosomeresourcelimitationsandthecomplexityoftheproject,Icouldnotproceedwithtestingmoresolutions.Thishadmostlytodowithmenotriskingcompromisingthedailyworkbytryingtoimplementnewsolutionsontheproductionclusterwithoutseniorsupervision.
6.4 Commercial alternatives Thereasontotestopensourcesolutionswasasearliermentionedtokeepthecostsdown.ThereareotherwisecommercialalternativesthatperformverywellandrunstablewithHadoop.AnexampleisExasolanin-memoryanalyticsdatabase.IhavetestedandworkedonExasolinotherprojectsandwouldrecommenditasanlow-latencysolutionforcompaniesthatcanconsiderpayingforsuchasolution.
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
29
6.5 Further research TuningandconfiguringPrestowithprofessionalhelpwouldbeafirststeptoseehowmuchfasterPrestocouldbecome.Itwouldalsobeinterestingtoseeifthereareanygainsregardingheavierqueriesandjobscomparedtothecurrentconfigurationsattruecaller.AnextstepwouldbetoseeifgainsinexecutiontimewouldbepossiblebytryingtotunetheparametersoftheORCfile.Thiswasanotherintentiontotestduringthisresearchbutcouldnotbedoneduetotimelimitations.
ApacheDrillandApacheTajoaretheothertwosolutionsthatweremeanttobetestedandevaluatedinthisthesiswork.Tajowasactuallyimplementedbutdidnotsupportsomeoftherequiredfiletypesanddatatypes.IhavebeeninclosecontactwiththedevelopersofTajoanditshouldmeetalloftruecaller’srequirementsinthenextrelease,whichwasinitiallyplannedforsometimeinApril2016,butwasdelayed.Afterthereleaseandifthereistimeandresources,lookingintobothofthesesolutionscouldbeofinterestfortruecaller.
Ifthereareadditionalresourcesandtime,IwouldrecommendtestingImpalabychangingfileformatsfromORCtoParquetanddorunthesametypeoftests.Becauseoftheresultsfrommostofthebenchmarksreadduringthisresearch,ImpalaisproventobeastrongcandidateforalightweightSQLonHadooptool.Theonlyproblemisthedifferenceinthefiletypes(ORCvs.Parquet)thattruecallerhaschosentostore.
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
30
7. Conclusion TherearemanySQLonHadoopsolutionsonthemarkettoday.PrestoisonoftheknownnamesofopensourceSQLonHadoopsolutionsandwaschosenasthetooltobetestedataftermeetingalltherequirements.SinceonlyPrestowastested,duetolimitations,itcannotbesaidthatitistheonlyand/orperfectsolutionfortruecaller’scase.Howeverithasansweredtheresearchquestionsofthisthesisbyprovingtobeonaverage5timesfasterthanHive.Prestowasprovenworkbestforlighterad-hocanalysisrunningonsmallerdatasetsorsampledatatakenformlargersetstoproveapoint.Prestodidnotworkwellwithlargerdatasetsandheavierqueriescontainingmultiplejoins/selfjoins.Prestocanbetunedandoptimizedtofittheparticularenvironmenttoperformatitsbest.Theuse-casesandalltestingweredoneattruecallerinStockholmandtheresultsarealldependingontheconfigurationandhardwareofthetruecaller’sHadoopcluster.Prestoisnowinusebytheanalyticsteamattruecallerforspecificad-hocusecasesandwillnotpermanentlyreplaceHiveoranyothertoolatthecompany.User’sthatknowthedatawillbeabletomaketherightdecisionforwhentoandnottousePresto.UsingPrestofortherightuse-caseswillsavetruecaller’sanalyticsteamtimeandwilladdefficiencytotheirwork,whichiswasthemainpurposeforthemtowanttotestanothersolution.BydeployingPrestoattruecaller,thisgoalhasbeenreachedfornow.
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
31
8. References
8.1 Books and articles • White,T.(2010).Hadoop:thedefinitiveguide.(2nded.)Farnham:O'Reilly.• Chen,Y.,Qin,X.,Bian,H.,Chen,J.,Dong,Z.,Du,X.,&Zhang,H.(2014).AStudyofSQL-on-
HadoopSystems.LectureNotesInComputerScience,(8807),154-166.• Walker,R.(2015).FromBigDatatoBigProfits:SuccesswithDataandAnalytics.Oxford
UniversityPress.• Gandomi,A.,&Haider,M.(2015).Beyondthehype:Bigdataconcepts,methods,and
analytics.InternationalJournalOfInformationManagement,35137-144.doi:10.1016/j.ijinfomgt.2014.10.007
• NationalInstituteofStandardsandTechnology.(2015)BigDataInteroperabilityFramework:Volume1,DefinitionsU.S.DepartmentofCommerce.
• Turkington,G.(2013).HadoopBeginner'sGuide.Birmingham:PacktPublishing.• Cervone,F.(2015).BIGDATAANDANALYTICS.OnlineSearcher,39(6),22-27.• Maitrey,S.,&Jha,C.(2015).MapReduce:SimplifiedDataAnalysisofBigData.Procedia
ComputerScience,57(3rdInternationalConferenceonRecentTrendsinComputing2015(ICRTC-2015),563-571.doi:10.1016/j.procs.2015.07.392
• Jukić,N.,Sharma,A.,Nestorov,S.,&Jukić,B.(2015).AugmentingDataWarehouseswithBigData.InformationSystemsManagement,32(3),200-209.doi:10.1080/10580530.2015.1044338
• Thusoo,A.,Sarma,J.,Jain,N.,Shao,Z.,Chakka,P.,Anthony,S.,&...Murthy,R.(2009).Hive-Awarehousingsolutionoveramap-reduceframework.ProceedingsOfTheVLDBEndowment,2(2),1626-1629.
• Connolly,T.M.&Begg,C.E.(2005).Databasesystems:apracticalapproachtodesign,implementationandmanagement.(4.,[rev.]ed.)Harlow:Addison-Wesley.
• Maposa,T.T.,Sethi,M.(2015).SQL-on-Hadoop:TheMostProbableFutureinBigDataAnalytics.AdvancesinComputerScienceandInformationTechnology(ACSIT),2(13),9-14.
• Srinivasa,K.G.,Mupalla,A.K.(2015).GuidetoHighPerformanceDistributedComputing,casestudieswithHadoop,ScaldingandSpark,Springer
• Soliman,M.,Petropoulos,M.,Waas,F.,Narayanan,S.,Krikellas,K.,&Baldwin,R.etal.(2014).Orca.ProceedingsOfThe2014ACMSIGMODInternationalConferenceOnManagementOfData-SIGMOD'14,337-348.http://dx.doi.org/10.1145/2588555.2595637
• Choudhury,K.(2015,October12).Truecaller'sIndiauserbasegrowsto100mn,BusinessStandard,NewDelhi.Retrievedfromhttp://www.business-standard.com/article/companies/truecaller-s-india-user-base-grows-to-100-mn-115101200037_1.html
• Rukmini,S.(2014,December13).Indialosingwaronspam,TheHindu.Retrievedfromhttp://www.thehindu.com/sci-tech/technology/10-lakh-complaints-about-spam-calls-in-the-last-three-years/article6689376.ece
• Peffers,K.,Tuunanen,T.,Rothenberger,M.A.,&Chatterjee,S.(2007).ADesignScienceResearchMethodologyforInformationSystemsResearch.JournalOfManagementInformationSystems,24(3),45-77.
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
32
8.2 Websites and blog posts • Truecaller.(2016).MAKINGYOURPHONESMARTER.Retrievedfrom
https://www.truecaller.com/about• Apache.(2016).HadoopWiki,Projectdescription.Retrievedfrom
http://wiki.apache.org/hadoop/ProjectDescription• ApacheHadoop.(2016).ApacheHadoopYARN.Retrievedfrom
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html• ApacheSliderIncubator.(2016).ApacheSliderArchitecture.Retrievedfrom
https://slider.incubator.apache.org/design/architecture.html• Teradata.(2016).Presto,aFaster,MoreScalable,MoreFlexibleOpenSourceSQLonHadoop
Engine.Retrievedfromhttp://assets.teradata.com/resourceCenter/downloads/Datasheets/EB8901.pdf?processed=1
• Presto.(2016).Overview.Retrievedfromhttps://prestodb.io/overview.html• Cloudera.(2016).Whatishive?Retrievedfrom
https://archive.cloudera.com/cdh4/cdh/4/hive/• AllegroTech.(2015,June10).FastDataHackathon[Blogpost].Retrievedfrom
http://allegro.tech/2015/06/fast-data-hackathon.html• Pivotal.(2014,June25).PivotalHAWQBenchmarkDemonstratesUpTo21xFaster
PerformanceonHadoopQueriesThanSQL-likeSolutions[Blogpost].Retrievedfromhttps://blog.pivotal.io/big-data-pivotal/products/pivotal-hawq-benchmark-demonstrates-up-to-21x-faster-performance-on-hadoop-queries-than-sql-like-solutions
• Cloudera.(2016,February11).NewSQLBenchmarks:ApacheImpala(incubating)UniquelyDeliversAnalyticDatabasePerformance[Blogpost].Retrievedfromhttps://blog.cloudera.com/blog/2016/02/new-sql-benchmarks-apache-impala-incubating-2-3-uniquely-delivers-analytic-database-performance/
• Fishman,J.(2014,July14).AppleSolarFrenzy:BenefitingShareholdersandthePlanetAlike[Blogpost].Retrievedfromhttp://solarstockideas.com/apple-solar-frenzy-benefiting-shareholders-planet-alike/
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
33
9. Appendices
9.1 Appendix A – Presto installation
Guide to install presto on hadoop cluster.
On docker container with cloudera one node cluster for test: Prerequisits:
§ HDFS§ YARN§ ZooKeeper§ Java§ Maven
Step 1. Install Slider to run presto application on yarn:
1. Create yarn user on containers hdfs with:
hdfs dfs -mkdir -p /user/yarn hdfs dfs -chown yarn:yarn /user/yarn
1. Downloadslider0.80.0foryarn(http://apache.mirrors.spacedump.net/incubator/slider/0.80.0-incubating/slider-assembly-0.80.0-incubating-all.tar.gz)
2. Untarfilewithcommand:tar -xvf slider-0.80.0-incubating-all.tar.gz
3. Opennewdirectoryandgotoslider-assembly/targetdirectoryandcopy0.80.0-incubating-all.ziptoanewfolderwhereyouwanttoinstallslider.
4. Unzipfileinfolder,thesliderfolderwillnowbecreated.5. Gotosliderfolder/confandopenfileslider-env.shsetenvvariables:
export JAVA_HOME=/usr/lib/jvm/java-8-oracle/ export HADOOP_CONF_DIR=/etc/hadoop/conf
6. Open file slider-client.xml change set values:
<property> <name>hadoop.registry.zk.quorum</name> <value>localhost:2181</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>localhost:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>localhost:8030</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:8020</value> </property>
7. Test slider with command :
${slider-install-dir}/slider-0.80.0-incubating/bin/slider version
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
34
If you get output "Compiled against Hadoop 2.6.0", slider is installed right.
Step 2. Install presto:
1. Download presto-yarn package from github:(gitclonehttps://github.com/prestodb/presto-yarn.git)
2. Buildpackagewithmaven:mvn clean package
3. Go to folder created and folder presto-yarn-package/target and copy presto-yarn-package-1.0.0-SNAPSHOT-0.130.zip into slider folder.
4. Gobacktopresto-yarn-packagefolderandto/src/main/resourcesandcopyfilesappConfig.jsonandresources-singelnode.jsonintosliderfolder.
5. Insliderolderrenameresources-singlenode.jsontoresources.jsonandopenitandeditsettings:{"schema":"http://example.org/specification/v2.0.0","metadata":{},"global":{"yarn.vcores":"1"},"components":{"slider-appmaster":{},"COORDINATOR":{"yarn.role.priority":"1","yarn.component.instances":"1","yarn.memory":"1500"}}}
6. OpenappConfig.jsonandeditsettings:{"schema":"http://example.org/specification/v2.0.0","metadata":{},"global":{"site.global.app_user":"yarn","site.global.user_group":"hadoop","site.global.data_dir":"/var/lib/presto/data","site.global.config_dir":"/var/lib/presto/etc","site.global.app_name":"presto-server-0.130","site.global.app_pkg_plugin":"${AGENT_WORK_ROOT}/app/definition/package/plugins/","site.global.singlenode":"true","site.global.coordinator_host":"${COORDINATOR_HOST}","site.global.presto_query_max_memory":"50GB","site.global.presto_query_max_memory_per_node":"512MB","site.global.presto_server_port":"8080",
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
35
"site.global.catalog":"{'hive':['connector.name=hive-cdh5','hive.metastore.uri=thrift://${NN_HOST}:9083'],'tpch':['connector.name=tpch']}","site.global.jvm_args":"['-server','-Xmx1024M','-XX:+UseG1GC','-XX:G1HeapRegionSize=32M','-XX:+UseGCOverheadLimit','-XX:+ExplicitGCInvokesConcurrent','-XX:+HeapDumpOnOutOfMemoryError','-XX:OnOutOfMemoryError=kill-9%p']","application.def":".slider/package/PRESTO/presto-yarn-package-1.1-SNAPSHOT-0.130.zip","java_home":"/usr/lib/jvm/java-8-oracle/"},"components":{"slider-appmaster":{"jvm.heapsize":"128M"}}}Testdifferentsettingsforenvironmentnodememory1024MBwastoomuchforusandwechangedto:site.global.presto_query_max_memory_per_node":"512MB"
7. Insliderfolderchangeusertoyarn(suyarn)andruncommandtoinstallpresto:
bin/slider package --install --name PRESTO --package presto-yarn-package-*.zip
8. Run presto with command:
bin/slider create presto1 --template appConfig.json --resources resources.json
If exited with status 0 presto app shoul now be running.
9. Go to YARN resourcemanager web UI to check if presto is running (http://science2.truecaller.net:8088) : (See first row application presto1 running).
10. Create new folder and download presto client jar file (https://repository.sonatype.org/service/local/artifact/maven/content?r=central-proxy&g=com.facebook.presto&a=presto-cli&v=RELEASE)
11. Rename jar file to to only presto (mv *.jar presto). Run command to make it executable:
chmod +x presto
12. To test presto go to hive client and create a table with data. Now go back to folder with presto as yarn user (su yarn) and type command to run presto:
./presto --server localhost:8080 --catalog system --schema default (or which schema wanted to use)
type show tables; if table is showing, tryout a query.
On hdp3 cluster:
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
36
Step 1. Install Slider to run presto application on yarn:
1. Create yarn user on containers hdfs with:
hdfs dfs -mkdir -p /user/yarn hdfs dfs -chown yarn:yarn /user/yarn
1. Downloadslider0.80.0foryarn(http://apache.mirrors.spacedump.net/incubator/slider/0.80.0-incubating/slider-assembly-0.80.0-incubating-all.tar.gz)
2. Untarfilewithcommand:tar -xvf slider-0.80.0-incubating-all.tar.gz
3. Opennewdirectoryandgotoslider-assembly/targetdirectoryandcopy0.80.0-incubating-all.ziptoanewfolderwhereyouwanttoinstallslider.
4. Unzipfileinfolder,thesliderfolderwillnowbecreated.5. Gotosliderfolder/confandopenfileslider-env.shsetenvvariables:
export JAVA_HOME=/usr/lib/jvm/jdk-8-oracle-x64/ export HADOOP_CONF_DIR=/etc/hadoop/conf
6. Open file slider-client.xml change set values:
<property> <name>hadoop.registry.zk.quorum</name> <value>---------------- </property> <property> <name>yarn.resourcemanager.address</name> <value>-----</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value> </value> </property> <property> <name>fs.defaultFS</name> <value>hdfs---8020</value> </property>
7. Test slider with command :
${slider-install-dir}/slider-0.80.0-incubating/bin/slider version
If you get output "Compiled against Hadoop 2.6.0", slider is installed right.
Step 2. Install presto:
1. Download presto-yarn package from github:(gitclonehttps://github.com/prestodb/presto-yarn.git)
2. Buildpackagewithmaven:mvn clean package
3. Go to folder created and folder presto-yarn-package/target and copy presto-yarn-package-1.0.0-SNAPSHOT-0.130.zip into slider folder.
4. Gobacktopresto-yarn-packagefolderandto/src/main/resourcesandcopyfilesappConfig.jsonandresources-singelnode.jsonintosliderfolder.
5. Insliderolderrenameresources-singlenode.jsontoresources.jsonandopenitandeditsettings:{"schema":"http://example.org/specification/v2.0.0","metadata":{},
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
37
"global":{"yarn.vcores":"8"},"components":{"slider-appmaster":{},"COORDINATOR":{"yarn.role.priority":"1","yarn.component.instances":"1","yarn.component.placement.policy":"1","yarn.memory":"4000"},"WORKER":{"yarn.role.priority":"2","yarn.component.instances":"13","yarn.component.placement.policy":"1","yarn.memory":"8000"}}
6. OpenappConfig.jsonandeditsettings:{"schema":"http://example.org/specification/v2.0.0","metadata":{},"global":{"site.global.app_user":"yarn","site.global.user_group":"hadoop","site.global.data_dir":"/var/lib/presto/data","site.global.config_dir":"/var/lib/presto/etc","site.global.app_name":"presto-server-0.142","site.global.app_pkg_plugin":"${AGENT_WORK_ROOT}/app/definition/package/plugins/","site.global.singlenode":"true","site.global.coordinator_host":"${COORDINATOR_HOST}","site.global.presto_query_max_memory":"50GB","site.global.presto_query_max_memory_per_node":"4GB","site.global.presto_server_port":"8080","site.global.catalog":"{'hive':['connector.name=hive-cdh5','hive.metastore.uri=thrift://hdp3.truecaller.net:9083'],'tpch':['connector.name=tpch']}","site.global.jvm_args":"['-server','-Xmx8192M','-XX:+UseG1GC','-XX:G1HeapRegionSize=32M','-XX:+UseGCOverheadLimit','-XX:+ExplicitGCInvokesConcurrent','-XX:+HeapDumpOnOutOfMemoryError','-XX:OnOutOfMemoryError=kill-9%p']","application.def":".slider/package/PRESTO/presto-yarn-package-1.1-SNAPSHOT-0.142.zip","java_home":"/usr/lib/jvm/jdk-8-oracle-x64/"},"components":{"slider-appmaster":{"jvm.heapsize":"128M"
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
38
}}}
7. Insliderfolderchangeusertoyarn(suyarn)andruncommandtoinstallpresto:
bin/slider package --install --name PRESTO --package presto-yarn-package-*.zip
8. Run presto with command:
bin/slider create presto1 --template appConfig.json --resources resources.json
If exited with status 0 presto app shoul now be running.
9. Go to YARN resourcemanager web UI to check if presto is running (See first row application presto1 running).
10. Create new folder and download presto client jar file (https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.142/presto-cli-0.142-executable.jar)
11. Rename jar file to to only presto (mv *.jar presto). Run command to make it executable:
chmod +x presto
12. To test presto go to hive client and create a table with data. Go to the resource manager web interface.find presto1 under running applications and find URL to coordinator node.
13. Now go back to folder with presto as yarn user (su yarn) and type command to run presto:
./presto --server h[URL to coordinator]:8080 --catalog hive --schema default (or which schema wanted to use)
typeshowtables;iftableisshowing,tryoutaquery
9.2 Appendix B – Use case queries
1.SELECT*withrestrictionANDLIMIT1:
SELECT*
FROMapp_events
WHEREUPPER(user.country_code)='IN'
ANDdt=20160221
ANDh=02
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
39
LIMIT10000;
2.SELECT*withrestrictionANDLIMIT2:
SELECT*
FROMapp_events
WHEREdt=20160221
ANDappsearch.typeIN('1','2')
LIMIT10000;
3.SELECT*withrestrictionANDLIMIT3:
SELECT*FROMapp_events
WHEREuser.register_id=2107422
ANDappsearch.type='2'
ANDdtbetween20160123AND20160125
LIMIT50;
4.SELECTCOUNT(*)withrestriction
SELECTCOUNT(*)FROMapp_events
WHEREupper(user.country_code)='IN'
ANDappsearch.type='2'
ANDdt=20160123;
5.SELECT*withrestrictionLIMITANDORDERBY:
SELECT*
FROMapp_events
WHEREdt=20160221
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
40
ANDuser.country_codeISNOTNULL
ANDappsearch.typeIN('1','2')
ORDERBYUPPER(user.country_code)
LIMIT10000;
6.SELECT*withrestrictionLIMITANDSORTBY:
SELECT*
FROMapp_events
WHEREdt=20160221
ANDh=02
ANDuser.country_codeISNOTNULL
ANDappsearch.typeIN('1','2')
SORTBYUPPER(user.country_code);
7.SimpleSELECTwithaggregates:
SELECT
SUM(appsearchv2.attempts[0].latency)sum_latency,
AVG(appsearchv2.attempts[0].latency)avg_latency
FROMdata.app_events
WHEREsize(appsearchv2.attempts)>0
ANDdtbetween20160130AND20160202
ANDupper(user.country_code)='IN'
ANDappsearchv2.attempts[0].endpointnotlike'%localhost%';
8.SELECTwithaggregatesANDGROUPBY:
SELECTuser.register_id,
appsearch.searchrequest.normalizedphonenumberasincoming_nr,
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
41
SUM(if(appsearch.type=2,1,0))incoming_calls,
SUM(if(appsearch.type=5,1,0))sms
FROMdata.app_events
WHEREdt=20160121
ANDappsearch.searchrequest.searchcountrycode='SE'
ANDappsearch.searchrequest.normalizedphonenumberISnotNULL
GROUPBYuser.register_id,
appsearch.searchrequest.normalizedphonenumber
LIMIT10000;
9.SELECTwithaggregateswithgroupANDorderby:
SELECTuser.register_id,
appsearch.searchrequest.normalizedphonenumberasincoming_nr,
SUM(if(appsearch.type=2,1,0))incoming_calls,
SUM(if(appsearch.type=5,1,0))sms
FROMdata.app_events
WHEREdt=20160121
ANDappsearch.searchrequest.searchcountrycode='SE'
ANDappsearch.searchrequest.normalizedphonenumberISnotNULL
GROUPBYuser.register_id,
appsearch.searchrequest.normalizedphonenumber
orderbyuser.register_id,
incoming_nr
LIMIT10000;
10.SELECTFROMinnerquerywithinnerGROUPBYANDouterorderby:
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
42
SELECTregister_id,
incoming_nr,
incoming_calls,
sms
FROM(
SELECTuser.register_id,
appsearch.searchrequest.normalizedphonenumberasincoming_nr,
SUM(if(appsearch.type=2,1,0))incoming_calls,
SUM(if(appsearch.type=5,1,0))sms
FROMdata.app_events
WHEREdt=20160125
ANDappsearch.searchrequest.searchcountrycode='SE'
ANDappsearch.searchrequest.normalizedphonenumberISnotNULL
GROUPBYuser.register_id,
appsearch.searchrequest.normalizedphonenumber)x
WHEREincoming_calls>0ORsms>0
orderbyregister_id
LIMIT10000;
11.JOINWITH2INNERQUERIESWITHAGGREGATESANDORDERBY:
SELECT
a.city,
COUNT(DISTINCTa.registerid)nr_IOS_profiles
FROM(
SELECT
LOWER(city)city,
registerid
FROMdata.user_profile_data
WHERELOWER(city)<>''
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
43
ANDdt=20160221
ANDregisterid<>0)a
INNERJOIN(
SELECT
id
FROMuserprimaryserviceprovider
WHERELOWER(os)LIKE'iphone%'
ANDdt=20160221)b
ONb.id=a.registerid
GROUPBYa.city
ORDERBYnr_IOS_profilesDESC
LIMIT10;
12.3joinswithaggregatesANDsortby:
SELECT
FROM_UNIXTIME(UNIX_TIMESTAMP()-86400,'yyyyMMdd')ASdt,
a.countrycode,
a.incoming_phonenr,
COALESCE(c.incoming_calls_24h,0L)incoming_calls_24h,
COALESCE(c.sms_24h,0L)sms_24h,
COALESCE(b.incoming_calls_3d,0L)incoming_calls_3d,
COALESCE(b.sms_3d,0L)sms_3d,
a.incoming_calls_7d,
a.sms_7d
FROM
(SELECT
countrycode,
incoming_phonenr,
incoming_calls_7d,
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
44
sms_7d
FROM(
SELECT
appsearch.searchrequest.searchcountrycodeascountrycode,
appsearch.searchrequest.normalizedphonenumberASincoming_phonenr,
SUM(if(appsearch.type=2,1,0))incoming_calls_7d,
SUM(if(appsearch.type=5,1,0))sms_7d
FROMdata.app_events
WHEREdt>=20160101ANDdt<=20160107
ANDupper(appsearch.searchrequest.searchcountrycode)IN('SE','US')
ANDappsearch.searchrequest.normalizedphonenumberISNOTNULL
GROUPBY
appsearch.searchrequest.searchcountrycode,
appsearch.searchrequest.normalizedphonenumber)x
WHEREincoming_calls_7d>0ORsms_7d>0
SORTBYcountrycode)a
LEFTJOIN
(SELECT
countrycode,
incoming_phonenr,
incoming_calls_3d,
sms_3d
FROM(
SELECT
appsearch.searchrequest.searchcountrycodeascountrycode,
appsearch.searchrequest.normalizedphonenumberASincoming_phonenr,
SUM(if(appsearch.type=2,1,0))incoming_calls_3d,
SUM(if(appsearch.type=5,1,0))sms_3d
FROMdata.app_events
WHEREdt>=20160101ANDdt<=20160103
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
45
ANDupper(appsearch.searchrequest.searchcountrycode)IN('SE','US')
ANDappsearch.searchrequest.normalizedphonenumberISNOTNULL
GROUPBY
appsearch.searchrequest.searchcountrycode,
appsearch.searchrequest.normalizedphonenumber)x
WHEREincoming_calls_3d>0ORsms_3d>0
SORTBYcountrycode)b
ONa.countrycode=b.countrycode
ANDa.incoming_phonenr=b.incoming_phonenr
LEFTJOIN
(SELECT
countrycode,
incoming_phonenr,
incoming_calls_24h,
sms_24h
FROM(
SELECT
appsearch.searchrequest.searchcountrycodeascountrycode,
appsearch.searchrequest.normalizedphonenumberASincoming_phonenr,
SUM(if(appsearch.type=2,1,0))incoming_calls_24h,
SUM(if(appsearch.type=5,1,0))sms_24h
FROMdata.app_events
WHEREdt=20160101
ANDupper(appsearch.searchrequest.searchcountrycode)IN('SE','US')
ANDappsearch.searchrequest.normalizedphonenumberISNOTNULL
GROUPBY
appsearch.searchrequest.searchcountrycode,
appsearch.searchrequest.normalizedphonenumber)x
WHEREincoming_calls_24h>0ORsms_24h>0
SORTBYcountrycode)c
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
46
ONb.countrycode=c.countrycode
ANDb.incoming_phonenr=c.incoming_phonenr
SORTBYcountrycode
;
13.Createtableoutof3joinsFROMlargetablewithmultipleaggregatesANDGROUPBY.
CREATETABLEsahir.nr_search_events_30days_v2AS
SELECT
a.dt,
a.country_code,
COALESCE(b.hits,0L)AStotal_result_hits,
COALESCE(a.hits,0L)AS1_result_hits,
COALESCE(c.hits,0L)AS0_result_hits
FROM
(SELECT
dt,
LOWER(user.country_code)country_code,
COUNT(*)hits
FROMdata.app_events
WHEREdtBETWEEN20160201AND20160301
ANDSIZE(search.sources)>0
ANDLOWER(search.sources[0])LIKE'%ugc%'
ANDsearch.result_size=1
GROUPBY
dt,
LOWER(user.country_code))a
LEFTJOIN
(SELECT
dt,
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
47
LOWER(user.country_code)country_code,
COUNT(*)hits
FROMdata.app_events
WHEREdtBETWEEN20160201AND20160301
ANDSIZE(search.sources)>0
ANDsearch.result_size>0
GROUPBY
dt,
LOWER(user.country_code))b
ONa.dt=b.dt
ANDLOWER(a.country_code)=LOWER(b.country_code)
LEFTJOIN
(SELECT
dt,
LOWER(user.country_code)country_code,
COUNT(*)hits
FROMdata.app_events
WHEREdtBETWEEN20160201AND20160301
ANDSIZE(search.sources)>0
ANDsearch.result_size=0
GROUPBY
dt,
LOWER(user.country_code))c
ONb.dt=c.dt
ANDLOWER(b.country_code)=LOWER(c.country_code)