BACHELOR THESIS -...

BACHELOR THESIS

Evaluating Presto as an SQL on HadoopSolution

A Case at Truecaller

Sahir Ahmed2016

Bachelor of ArtsSystems Science

Luleå University of TechnologyDepartment of Computer science, Electrical and Space engineering

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

EvaluatingPrestoasanSQLonHadoopsolution

acaseattruecaller

SahirAhmed–VT2016


Abstract Truecallerisamobileapplicationwithover200millionuniqueusersworldwide.Everydaytruecallerstoresover1billionrowsofdatathattheyusetoanalyseforimprovingtheirproduct.ThedataisstoredinHadoop,whichisaframeworkforstoringandanalysinglargeamountsofdataonadistributedfilesystem.Inordertobeabletoanalysetheselargeamountsofdatatheanalyticsteamneedsanewsolutionformorelightweight,ad-hocanalysis.ThisthesisevaluatestheperformanceofthequeryenginePrestotoseeifitmeetstherequirementstohelpthedataanalyticsteamattruecallergainefficiency.Byusingadesign-sciencemethodology,Presto’sprosandconsarepresented.Prestoisrecommendedasasolutiontobeusedtogetherwiththetoolstodayforspecificlightweightusecasesforusersthatarefamiliarwiththedatasetsusedbytheanalyticsteam.Othersolutionsforfutureevaluationarealsorecommendedbeforetakingafinaldecision.

Keywords:Hadoop,BigData,Presto,Hive,SQLonHadoop


Table of contents Abstract.............................................................................................................................2

Tableofcontents...............................................................................................................3

Tableoffigures..................................................................................................................5

Tableoftables...................................................................................................................6

Glossary.............................................................................................................................7

1. Introduction................................................................................................................11.1Problemstatement................................................................................................................1

1.1.1Problemdescription...............................................................................................................11.1.2Researchquestions................................................................................................................11.1.3Purpose..................................................................................................................................1

1.2Motivationsforresearch........................................................................................................21.3Delimitations.........................................................................................................................2

2. Background.................................................................................................................32.1Truecaller...............................................................................................................................32.2BigData.................................................................................................................................3

2.2.1WhatisBigData?...................................................................................................................32.2.2DataAnalytics.........................................................................................................................52.2.3DataWarehouse.....................................................................................................................5

2.3Hadoop..................................................................................................................................72.3.1WhatisHadoop......................................................................................................................72.3.2TruecallerandHadoop...........................................................................................................72.3.3MapReduce............................................................................................................................82.3.4HDFS.....................................................................................................................................102.3.5YARN.....................................................................................................................................11

2.3.5.1Slider.............................................................................................................................................132.3.6SQLonHadoop.....................................................................................................................13

2.3.6.1Hive..............................................................................................................................................132.3.6.2Presto...........................................................................................................................................15

3. Literaturereview.......................................................................................................173.1Purposeofliteraturereview................................................................................................173.2Benchmarks.........................................................................................................................17

3.2.1Fast-data-hackathon............................................................................................................173.2.2RenminUniversity................................................................................................................183.2.3Commercialbenchmarks......................................................................................................19

3.2.3.1Pivotal(HAWQ)............................................................................................................................193.2.3.2Cloudera(Impala).........................................................................................................................20

3.3Hivenotmeantforlow-latencyquerying.............................................................................21

4. Researchmethodology..............................................................................................224.1Identifyproblems&Motivate..............................................................................................224.2DefineObjectivesforsolution..............................................................................................234.3Design&Development........................................................................................................23


4.4Demonstration.....................................................................................................................244.5Evaluation............................................................................................................................244.6Communication...................................................................................................................24

5. Result........................................................................................................................255.1Clusterdetails......................................................................................................................255.2Implementation...................................................................................................................255.3Usecasesandlatencyinseconds.........................................................................................255.4Resulttable..........................................................................................................................27

6. Discussion..................................................................................................................286.1Prestoattruecaller..............................................................................................................286.2Usecases.............................................................................................................................286.3Limitations...........................................................................................................................286.4Commercialalternatives......................................................................................................286.5Furtherresearch..................................................................................................................29

7. Conclusion.................................................................................................................30

8. References.................................................................................................................318.1Booksandarticles................................................................................................................318.2Websitesandblogposts......................................................................................................32

9. Appendices................................................................................................................339.1AppendixA–Prestoinstallation..........................................................................................339.2AppendixB–Usecasequeries.............................................................................................38


Table of figures Fig1.Exponentialgrowthofdata(Fishman,2014,July14).................................................4

Fig2.TheBigdataprocessandsubprocesses(Gandomi&Haider,2015)...........................5

Fig3.Starschema(Connolly&Begg,2005)........................................................................6

Fig4.SimplifiedMapReducedataflow(White,2010).........................................................7

Fig5.MapReduceflowwithonereducer(White,2010).....................................................9

Fig6.MapReducedataflowwithmultiplereducers(White,2010)...................................10

Fig7.HDFSArchitecture(Srinivasa&Mupalla,2015)........................................................11

Fig8.ApplicationsrunningonYARN(White,2010)...........................................................12

Fig9.YARNcomponents(ApacheHadoop,2016)..............................................................13

Fig10.HiveArchitecture(Thusooetal.,2009)..................................................................15

Fig11.PrestoArchitecture(Teradata,2016).....................................................................16

Fig12.fast-data-hackathonresults(TreasureDataBlog,2015,March20).........................18

Fig13.RenminUniversityresultsfor100-nodecluster(Chenetal.,2014).........................19

Fig14.Hawq’sspeedupratiovs.Impala(Solimanetal.,2014).........................................19

Fig15.Cloudera’sImpalabenchmark(Cloudera,2016,February11)................................20

Fig.16DesignScienceResearchMethodologyPeffers,Tuunanen,Rothenberger&Chatterjee,2007)..............................................................................................................22

Fig17.Queryexecutiontimeinseconds...........................................................................27


Table of tables Table1.Glossary................................................................................................................7

Table2.Whatisbigdata?(Cervone,2015).........................................................................4

Table3.RDBMScomparedtoMapReduce(White,2010)...................................................8

Table4.Requirements......................................................................................................23

Table5.Clusterdetails......................................................................................................25


Glossary Hadoop OpensourceJavaframeworkfor

distributeddatastorageHDFS HadoopdistributedfilesystemYARN (Yetanotherresourcenegotiator)

resourcemanagerforHadoopRDBMS Relationaldatabasemanagement

systemNode Commoditycomputer(hardware)used

byHDFSCluster Agroupofconnectednodesusedby

HDFSHive SQLdatawarehouseforHadoopMapReduce Programmingmodelforprocessingdata

onHadoopPresto SQLontopofHadoopsolutionSlider Applicationtorunnon-Yarn

applicationsonYARN

Table 1. Glossary


1

1. Introduction Thissectionintroducesthereadertotruecaller,thecompanywherethisstudyisconductedandgivessomebackgroundinformationaboutthecurrentsituationregardingthedatathatisstoredandanalysed.Followingthatinformationabackgroundregardingthedatawarehouseandtoolsusedfordataanalyticswithintheanalyticsteamoftruecallerisexplained.Hadoopisoneofthemaintoolsusedanddescribedonafairlyhigh-level.Truecallerisamobileappusedbyover200millionusersworldwide.Attruecallertoday,thedailyincomingdata(abouthalfabillionrows)isstoredinasystemdistributedfilesystemusingtheopensourceJavaframeworkHadoop.Thecollecteddataisusedbyallthedepartmentsofthecompanyforimprovingtheproductandtakingnewdecisions.Inordertoanalysethelargeamountsofdata,theData-AnalyticsteamislookingforanewlowlatencysolutionthatcanbeusedasacomplementtotheirstandardHadooptools,whichinsomecasesareconsideredtooslow.Thegoalofthisresearchistoidentify,testandevaluateanumberoftoolsthatfitstheteam’srequirementsandcouldpotentiallybeimplemented.

1.1 Problem statement “Runningad-hocqueriesonHiveistooslowfordatadiscovery”–BjörnBrinne,HeadofData-AnalyticsatTruecaller

1.1.1 Problem description Theproblemwithinthedata-analyticsorganisationattruecalleristhattheuseofHiveisnotfastenoughforad-hocqueryingandon-the-flydatadiscovery.TheusersruntheirqueriesinHiveandlongerquerieswithmultiplejoinsandaggregationscantakeuptoseveralhourstoexecute.Afterthequeryhasrun,theusermightnotbesatisfiedwiththeoutcomeandwillhavetoadjustandrerunthequery,sometimesinmultipleiterations.Thismakesthewholeprocessineffectiveandholdsbackthedatascientistsandotherusersfromusingtheircreativitytoaddmorevaluetotheirjobandcompanybecauseofpoorperformanceofthetools.Italsolimitsthemtojustfollowwhatisprioritizedatthattimeandleadstosomeworknotbeingdoneorpushedreallyfarbackinpriorityand.Thiscanalsomaketheanalyticsorganisationlessinnovative.

1.1.2 Research questions 1.HowmuchfasterthanHiveisPrestoontruecaller’sHadoopcluster?2.WhichtypeofusecasesisPrestosuitable/notsuitablefor?

1.1.3 Purpose Thepurposeofthisprojectistoidentify,testandimplementasolutiontosupportdata-discoveryandad-hocqueryingofdatainthetruecallerdatawarehouse.ThesolutionisaimedtocomplementHadoopandwillhopefullybeimplementedandusedbytheanalytics-teamattruecaller.Thereareseveralcommercialandopensourceproductsaresupposedtoprovidethatfunctionalitytotheusers.Thoseproductsoftenvaryinperformanceanddependontheenvironmentthattheyaresetupin.Therefortheaimandpurposeofthisresearchisonlymeanttosolvetheissuesanduse-casesattruecallerandisnotintendedtobeusedasageneralrecommendationorsolutionforothercompaniestorelyupon.However,theoutcomescanprovidethereaderswithanoverviewthatcan


2

becomparedtootherarticlesonthesubject.Themaingoalistoadvicetruecalleronwhichoftheidentifiedsolutionstoimplementanduseasananalyticaltool.

1.2 Motivations for research ThemainmotivationistotestandevaluatenewertechniqueswithinthegrowingBigDatafieldandtocomparethemwiththeindustrystandardtoolHive.Asecondarymotivationistoprovidesupportandhelptothedata-analyticsteamattruecaller,whichisundergoingrapidgrowth,andwishestofindamoresuitabletoolforfasterqueryingandanalysisoftheirdailyincomingdata.

1.3 Delimitations Theprojectwilltakeplacehalftimeunder4monthsandwillbebasedonpre-specifiedusecasesfortruecaller.Onesolutionistobetested.Firstonatestenvironmentlikee.g.avirtualonenodeclusterandasasecondstageintheproductionenvironment.Thesolutionisfreeopensourcetokeepthecostsdown.Thisprojectwillmostlyfocusonhowfastthedataispresentedandwillnotgoindepthonhowthetechniquesworkbehindthescenes,duetothesubjectofInformationSystemsScience.Thedatasetswillbepredefinedandtheaimsotofinallyusethedatawarehouseanddailyincomingdata.


3

2. Background

2.1 Truecaller TruecallerisaSwedishmobileapplicationcompanybasedinStockholm.Theirflagshipapplication;truecallerisdevelopedtoprovideuserswithinformationaboutwantedandunwantedcallsfromunknownnumbers.Bydoingthattruecallerhelpstheusersidentifyingspamcallsandmessages,givingthempossibilitiestoblockthosenumbersandalsosharethisinformationwiththerestoftheuserbase(Truecaller,2016).

Truecallerhastodayover200millionusersandissteadilygrowing.ThelargestusergroupislocatedinIndiawheretherearemorethan100millionusers.InIndiaalone,truecallerhelpsidentifyabout900millioncallsand120millionspamcallseverymonth.Thecompanystoresover2billionnamesandnumbersinitsdatawarehousewhereitislateranalysedforfurtheractions.Workingtogetherwithseveraltelecomprovidershelpstruecallertoreachouttonewusersindifferentcountries(Choudury,2015,October12).

InIndiawheretruecallerisverypopular,thenumberofunwanted/spamcallsisabigproblem.Thegovernmenthasspecialregulationsconcerningspamcalls(Rukmini,2014,December13)andthiscanbeonefactortothepopularityofthetruecallerapplication.

Truecallerstoresover1billiondailyeventsintheirdatawarehouse,wherethisdataislaterusedforanalysis.Theseanalysiscanbebasedonuserdata,applicationperformance,A/Btestingetc.andareusedforfurtherdevelopmentoftheapplication.

2.2 Big Data “Therewere5Exabyte’sofinformationcreatedbetweenthedawnofcivilizationthrough2003,butthatmuchinformationisnowcreatedeverytwodays.”-EricSchmidt,SoftwareEngineerandExecutiveChairmanofGoogle(Walker,2015)

2.2.1 What is Big Data? ThedefinitionofBigDataisoftendiscussedandisapopularandrecurringsubjectwithintheITworld.Somechoosetodefine“What”BigDataiswhileotherschoosetolookatwhatit“Does”(Gandomi&Haider,2015).Thedefinitionisthereforenotcrystalclearanduseof“thethreeV:s(Volume,VarietyandVelocity)”hasemergedandisbeingusedtodescribeBigData.VolumehastodowiththesizeandisoftenthefirstthingthatcomesinmindwhentalkingaboutBigDataseefig.2.WhentalkingaboutsizesinBigData,sizesovermultiplePetabytesoftencomestomind(Walker,2015)Keepinmindthatthesesizesincreaseallthetime.ItincreasesevenfasterthanMoore’sLawthatmeansitwoulddoubleinsizeeverytwoyears.Theincreaseofdataactuallydoublesinsizeaboutevery18months(NationalInstituteofStandardsandTechnology,2015).Varietyconcerningthevaryingtypesofdata,whichshowsthatthemajority(over90%)ofthedataisunstructuredduetothedifferenttypesofdatathatisuploadedallthetimethroughdifferentchannelssuchas,photos,videos,tweetsetc.Structureddataisstoredinrelationaldatabasesintables;thisisnotthecasefordatastoredinHadoop.Velocityreferstothespeedofwhichthedataisproduced,storedandanalysedseefig.1.Overthelatestyearswheresmartphoneshavebecomewidelyused,therateofdatavelocityhasexplodedandtheneedsforreal-timeanalyticsareinhigh-demand.Usingdatafromendusersfromasmartphonesvendorscananalysedatacollectedfromthehandhelddevices


4

suchasgeospatiallocation,phonenumbers,demographicsetc.(Gandomi&Haider,2015).Thesearethethree“main”V:sbutovertimeadditionalV:ssuchasVariability,Value,Validity,Veracityetc.havemadeitintodiscussionindefiningBigDataanddataanalytics(NationalInstituteofStandardsandTechnology,2015).

Fig 1. Exponential growth of data (Fishman, 2014, July 14).

Table 2. What is big data? (Cervone, 2015). HereisanexampleofoneofmanydefinitionsofBigData:“BigDataconsistsofextensivedatasets-primarilyinthecharacteristicsofvolume,variety,velocity,


5

and/orvariability-thatrequireascalablearchitectureforefficientstorage,manipulation,andanalysis.”(NationalInstituteofStandardsandTechnology,2015)

Sowhat“Does”BigDatado,whenisitusedandbywho?Wetakealookatitinnextparagraph.

2.2.2 Data Analytics BigDataisuselessifitisjuststoredandneverused.Themainreasonforstoringallthedataistoanalyseitandsomehowmakeuseofit.Withthehelpofdataanalysis,companiescanmakeimportantdecisionsfortheirprocesses(Walker,2015).Thewaysofanalysingdataisnotsodifferentfromthepast,usingstatisticalmodelsandformulas.Thedifferenceismostlyaboutthesizeofthedatasetsandthespeedofanalysinglargeramountsofdata.Also,withthehelpoftoolssuchasHadoop,companiescanfocusmoreoncausationbecauseoftheavailabilityofdatathatcanbeprovided.Theuseofcausationhelpsustodeterminewhysomethingishappeningwhichleadstomakingbetter(NationalInstituteofStandardsandTechnology,2015).AccordingtoLambrinisandJagadish(2012)(Walker,2015),theprocessforextractinginsightsofdataisdividedintofivestagesthatarepartofthetwosubprocesses,data-managementandanalyticsseefig.3.Thefirstthreestagesarewithinthemorebackendprocessdata-managementandcoversacquiringandstoringthedata,cleaningandpreparingitandthenpresenting/deliveringittotheanalysisprocesswherethedataisactuallyanalysed,interpretedandvisualized(Gandomi&Haider,2015).

Fig 2. The Big data process and sub processes (Gandomi & Haider, 2015).

2.2.3 Data Warehouse Adatawarehouseisadataset/databasedatacollectedfromoneormultipledatasourcesusedtoanalysehistoricaldatatohelporganisationswithdecision-making.In1990BillInmonintroducedthetermdatawarehouse,andin1993hestatedthatdatawarehouseisdefinedas:

- Subject-oriented:Meaningthatthedatafocusesonspecificsubjectsregardingthebusinesse.g.customers,productsetc.

- Integrated:Becausethedataoftencomesfrommultiplesourcesthedatahastobere-structuredtoholdaconsistentpatternofthedatawarehouse.


6

- Timevariant:Thedataisfetchedindifferenttimesanddatesithastobeshownsothatitcanmakesensefordifferenttimeintervalsandtimerelatedmatters.

- Non-volatile:Thedatagrowsincrementallyanddoesnotreplacetheolddatabutgrowssteadilyovertime(Connolly&Begg,2005).

DatawarehousesareoftenconnectedtoOLAP(OnlineAnalyticalProcessingTools)toolsforanalysis,miningandmachinelearning.TheprocesswhenadatawarehouseretrievesnewdatafromitsdatasourcesiscalledETL(extracttransformload).Datawarehousesareoftenmodelledusingastar-schema,whichdiffersfromanormalrelationaldatabasemodel(Jukić,Sharma,Nestorov&Jukić,2015).Astar-schemaisamultidimensionalmodelseefig.4usingfacttablesthatrepresentsabusinessobjecte.g.apurchase.Thefacttablesareconnectedtomultipledimensiontableswithspecificgrainsthattogethermakeupthefactsinthefacttablese.g.aproductdimension,timedimensionetc.Togethertheycanshowwhatproductwaspurchasedatwhattimeanddateetc.Datawarehousescanbemodelledwithdifferentschemastooe.g.snowflake-schema,OLAPcubeetc.(Connolly&Begg,2005).

Fig 3. Star schema (Connolly & Begg, 2005).


7

2.3 Hadoop 2.3.1 What is Hadoop HadoopwascreatedbyDougCuttingbetween2003and2006.CuttingwasworkingonanopensourcesearchenginecalledNutchwherehewasinspiredbyGoogle’stechnologiesGFS(GoogleFileSystem)andMapReducethattogethermadeupaplatformforprocessingadataefficientlyonalargescale(White,2010).FromthosetechnologiesCuttingstarteddevelopingHadoopthatquicklybecameaprioritizedprojectwithinApacheopensourcefoundation.YahoowhosoonhiredCuttingstronglysupportedtheprojectwithhelpandresources.ThenameHadoopcomesfromCuttingsson’syellowtoyelephantandisalsowhythelogoforHadooplooksthewayitdoes.By2008,HadoopwasbeingusedbyseverallargecompanieslikeAmazon,Facebooketc.(Turkington,2013).

HadoopisaJavabasedframeworkfordistributeddatacomputationandstoring.It’slinearscalabilityandcheapinfrastructuremakesiteasytomaintainandupgrade,hencemakesitthemainactorforBigDataprocessing.Hadoopismadeupoftwomaintechnologies,HDFSandMapReduce,whicharecompletelydifferentbutcomplementeachother.ByusingHadoop,largesetsofdatacanbecomputedinparallelonallthenodesofthecluster.Thiswayallcomputingpowerandresourcescanbeutilizedefficiently(White,2010).

2.3.2 Truecaller and Hadoop TruecallerstoresallitsdatainHadoop.HadoopprocessesthedatabyMapReducejobsrunninginparalleloveraclusterofdistributednodesusingHDFS(HadoopDistributedFileSystem).MapReduceisaprogrammingmodelthatsplitsthedataprocessesinaquerywheretheMapstagesortsandpairsthedatainkey-valueformatandthenthedataissummarizedandpresentedintheReducephaseseefig.5.

Fig 4. Simplified MapReduce dataflow (White, 2010).

ThemainadvantageofusingHadoopinsteadofarelationaldatabaseforstoringandanalysingBigDataisbecauseofcostandperformance.HadoopisrelativelycheapbecausethenodesarenormalcommoditycomputersthatinteractinHDFS.Itisalsolinearlyscalablewhichmeansthatwhenitisneededtoaddmorecapacityorspace,addingnewnodestotheclustercaneasilysolvetheissue(White,2010).WhenusingaRDMSthedatahastobestoredondiskandinmemory,whichcanbeproblematicwhenitcomestolargeamountsofdatae.g.Petabytes.Therefore,spreadingtheworkloadonmultiplenodeshelpssolvethisissue(Chenetal.,2014).However,therearesomeotherdifferenceswhereHadooplosesagainsttheRDBMSseetable.1.ThetypesofqueriesthataremainlysupposedtobeexecutedinHadoopareMapReducebatchjobs.Hadoopwasnotmeanttobeused


8

forinteractiveon-the-flyad-hocqueriesordatadiscovery.RDBMSarebetterforad-hocqueriesandprovideamuchmoreinteractiveuserinterfacethanHadoopwasintendedtodo.

Table 3. RDBMS compared to MapReduce (White, 2010).

InrecentyearsmanynewSQLonHadoopenginesthatletstheuserhaveanSQLinterfaceandlanguagetostructureandquerythedatawhileusingMapReduce,havebeendevelopedtosupportandprovidead-hocquerying.HiveisnowmoreorlessanindustrystandardtoolforSQLonHadoop(White,2010)(Chenetal.,2014).Hiveisusedasthemaindatawarehousetoolattruecaller.

2.3.3 MapReduce MapReduceisaprogrammingmodelusedforprocessingbigdatadistributedandinparallelacrosstheHadoopcluster.MapReduceisoneofthetwomainpartsofHadoopandcanbewritteninmanydifferentprogramminglanguages.Asnotedearlier,theMapReducemodelconsistsoftwodifferentphasesthe;

a. Map,whichtakestheinputdataand,b. Reduce,whichoutputsthefinalresult.

InMapstagetheinputdataisdividedintosplits,whichhaveonemaptaskforeachsplit.Themapfunctionthenrunsforeachrecordinthesplit,andYARN,whichisHadoop’sresourcemanagerschedulesthetasksanddividesthemamongthenodesinthecluster.IfajobfailsYARNseestoitthatitisre-runonanothermachine.HavingtherightsizeofsplitshelpstheperformanceoftheMapReducejob,becausewhenthesizeofthesplitsaretoobigthejobwillbedividedintofewermachinesandwilltakelongertoprocess,whileiftherearemanynodesandthesplitsaresmallthejobcanbedividedamongmanymachineswhichcanprocesstheminparallel(Maitrey&Jha,2015).Thisissomethingtheuserhastosetuprightfortheusedenvironment.Normally,asplitwillbe128MB,likethesizeofaHDFSblock,whichisthedefaultsize.Ablockandasplitisnotthesamething.Ablockisaharddivisionofthedata,whichispre-set.Becausearecordofdatadoesnotalwaysfitinoneblockorcanoverlaptoanotherblock,theMapReducejobusessplitskeeptrackofthedifferentrecordstoprocess.ThenumberofsplitsforaMapReducejobcanthereforenotbepredefinedbutisentirelydependentofthesizeofthedatatobeprocessedinthejob(White,2010).


9

Forbestresults,HadooptriestorunthemapphaseonthesamenodeaswheretheinputdataisstoredonHDFS.Thisway,networkbandwidthisnotwasted.IfthatnodeisbusyrunningothermaptasksHadoopchoosesanothernodeintheclusterwiththereplicateddata.Thenameforthatprocessiscalleddatalocalityoptimization.

Asmentioned,foreveryinputsplittherewillbeonemapperandthemapperoutputsakey-valuepairasaresult.Thesekey-valueoutputsarewrittentothelocaldiskonthenodeandnotonHDFS.Thisisbecausetheresultofthemaptaskisonlyfortemporaryuse,andoncethefinalresultofthereducetaskiscompleted,thatdatawillbedeletedfromthelocaldisk.Thereducephasedoesnothavethedatalocalityoptimizationbecauseareducerreceivesdatafrommultiplemapperswhichhavedatafromdifferentnodes.Afterthemapphasethekey-valuepairforeveryrecordisgroupedbythekeywithalistwiththevaluesbeforefilteredtotherightoutputinthereducephase(Turkington,2013).ThisistheprocedureforaMapReducejobwithonlyonereducerseefig.6.Ifdealingwithalargesetofdata,itisadvisabletousemorethanonereducertoavoidperformanceissuesandbottlenecks.

Thephasebetweenmapandreduceiscalledtheshufflephase.Theshufflephaseisveryimportantbecauseitiswherethekey-valuepairsaresortedandmergedtogether,creatingthekeywiththelistofvaluesthatareusedinthereducephasebeforethefinaloutput(White,2010).Itcomesintofocuswhenthejobsareusingmorethanonereducerseefig.7.

Fig 5. MapReduce flow with one reducer (White, 2010).


10

Whenusingmorethanonereducer,themapperspartitionsthedata,onepartitionforeveryreducer.Thisway,intheshufflephasewhendataissortedandmergedtherightkey-valueoutputgoestotherightreducer.Withoutthepartitioningtheshufflephasewouldnotknowwhichreducerthedatashouldbesentto,whichwouldresultinhavingsamekeysspreadoutondifferentreducerswhichwouldoutputaresultforeachreducer.Thiswouldresultinanoutputthatwillbeprocessedineffectively,andwouldquitelikelybewrong(Maitrey&Jha,2015).

Fig 6. MapReduce data flow with multiple reducers (White, 2010).

2.3.4 HDFS TheHadoopdistributedfilesystemorHDFSisthesecondandotherimportantpartofHadoop,togetherwithMapReduce.HDFSisthefilesystemusedonaHadoopclusterandisusedforstoringlargesetsofdataonaclusterofdistributednodes.OneofthekeyfeaturesofHDFSisthatitisrunoncommodityhardwarenodes,whichmeansthatisnormalpricedhardwarethatcanbeboughtinmultiplevendors,andnotspeciallydesignedforHadoop.HDFSisdesignedtokeeprunningifonenodefailsbecausethatissomethingthathappensfrequentlyandmustbeaccountedfor.

Thedataprocessingisstreamprocessedwhichmeansitiswrite-once,read-many-times.Thisconceptismeantthatthedatasetsusedarewrittenonceandthenusedforprocessingovertimeandtherefortheimportanceofreadingthedatasetsisbiggerthanlowlatencyfortheoutputresults.Thehighthroughputofthedatacomeswiththecostoflatency.ThelimitforthenumberoffilesstoredinHDFSisdependantofthememorysizeofthemasternode(s),alsoknownasthenamenode.ThenamenodestorestheMetadataofthefilesysteminitsmemory(location,typeetc.)andisusedtolocatethefilesonthedifferentworkernodes(White,2010).


11

Thefilesystemworksbyusingnamenodesanddatanodes(master/slave)seefig.8.InitiallythereusedtobeonenamenodeinHDFS,butinlaterversionsofHadoop,theuseofmultiplenamenodeshavebeenintroducedforstabilityandoff-loadpurposes.Todaynormally,thereareprimaryandsecondarynamenode(whichareinstandbymode).ThenamenodemanagesthefilesystemtreeandmetadataandkeepstrackofallthefilesanddirectoriesontheHDFS,whicharestoredontherestofthenodes,whicharethedatanodes.ThelimitationforstoringfilesinHDFSisthememoryonthenamenode,whichholdsthereferencetoallthefilesinthesystem.Aslongasthememoryonthenamenodeisnotfull,nodesanddiskspacecanbeaddedtothecluster,whichmakesitscalable.ThedatanodesaretheworkersintheHDFSsystemandallthedataonHDFSisstoredinblocksonthedatanodes.Assoonasachangeoccurstodataonadatanodethatisfoundinthenamespace,itisrecordedonthenamenode(Maitrey&Jha,2015).Theuseofstoringdatainblockshastheadvantagesthatfileslargerthanthedisksonthenodesdoesnotbecomeaproblem,sinceitcanbedistributedamongmultipleblocksinHDFS.Theblocksarereplicatedanddistributedontheclusterforbackupandfailureconcerns.Thedefaultreplicationfactoris3,whichmeansthatablockwillbereplicatedandspreadthreetimesonthecluster.ThismakesHDFSveryreliablewhenanodefailsorgoesdown(Apache,2016)(Srinivasa&Mupalla,2015).

Fig 7. HDFS Architecture (Srinivasa & Mupalla, 2015)

2.3.5 YARN ApacheYARNor(YetAnotherResourceNegotiator)isHadoop’sresourcemanagementsystemandwasincludedwithHadoop2.0.YARNisusedforbothresourcemanagementacrosstheclusterandalsofornodemanagement,managingtheresourcesforeverysinglenodebyusingcontainers.The


12

containersareusedforrunningapplicationsontopofYARNandhavetheirownallocatedresourcessuchasCPUandmemory.ApplicationscanbebuiltonYARNusingdifferentAPIs,andYARNisoftennotsomethingthattheenduserhastobeconcernedabout.ApplicationsthatuseYARNasanintermediatelayerbetweenthemandthestoragelayerarecalledYARNapplicationsseefig.9.ThiswayYARNmakesiteasiertorunapplicationsontheclusterandseestoitthattheresourcesareallocatedright(White,2010).

Fig 8. Applications running on YARN (White, 2010) YARNconsistsof:

a. Theresourcemanagerb. Nodemanager,andc. Theapplicationmaster.

Theresourcemanagerlooksaftertheclustersresourceswhilethenodemanagerhandlesthenodesandcontainers.Theapplicationmaster(oneperapplication)negotiatesresourcesfromtheresourcemanagersoandtogetherwiththenodemanagersithelpstheapplicationrunandcomputetasks.

Theresourcemanageritselfconsistsof:

a. Thescheduler,andb. TheApplicationManager.

Thescheduleristhecomponentthatschedulesandallocatesalltheresourcesneededforrunningapplicationsbasedonwhatresourcesareneededandwhichcanbefoundinthecontainers.ApplicationManagerensuresthattheapplicationgetstherightcontainersfromtheschedulersothattheapplicationcanrun.Whilerunningitisalsoresponsibleforrestartingwhenfailuresoccur,andtotrackandmonitorthestatusoftheapplicationseefig.10(ApacheHadoop,2016).


13

Fig 9. YARN components (Apache Hadoop, 2016).

2.3.5.1 Slider SliderisaYARNapplicationdevelopedtorunnon-YARNapplicationsonYARNandtosimplifythisprocess.Withslider,differentusers,usingdifferentinstancesoftheapplication,canrunanapplicationonYARN.Applicationscanalsobestopped,killedandrestartedetc.Anexampleofanon-yarnapplicationthatcanberunwithslideristheHadoopqueryenginePresto(ApacheSliderIncubator,2016)(White,2010).

2.3.6 SQL on Hadoop SQL(StructuredQueryLanguage)hasbeenusedforanalysingdatainrelationaldatabasesforalongtimeandisanindustrywidestandardlanguagewithinBusinessIntelligencetoo.InordertohelpuserswithoutsoftwareengineeringskillstouseHadooptoanalyseandqueryitsdata,differentSQLontopofHadoopsolutionshavebeencreated.Moreandmorecompetingsolutionsemergenowandthentotrytogivetheuserstheeasiestandfastestsolutiononthemarket(Turkington,2013).AccordingtoT.T.MaposaandM.Seth(2015)thefutureofSQLonHadoopisverybrightbecauseofthesimplicityandwideuserrangeofSQLandthepoweroftheHDFS,togetherformingaverysoughtaftersolution(Maposa&Sethi,2015).

2.3.6.1 Hive HiveisanopensourcedatawarehousingsolutionforHadoopandhasanSQLlikequeryinglanguageHiveQL.HivewasdevelopedatFacebookin2007andisaverybroadlyusedtoolthathasbecomeanindustrywidestandard.HivecompilestheHiveQLstatementsandintoMapReducejobsandreturnstheresultstotheuserinaSQLlikemanner.TheHiveQLlanguagehasmanyofthestandardSQL


14

functionsanduserscanaddcustomizedfunctions(UDFs)toaddmorefunctionality.Primitives,strings,arrays,mapsandstructsaremanyofthedatatypessupportedinthetables,whichisoneofmanyreasonsforHive’spopularity.Hivecanberunbothinteractivelylikeanormaldbmse.g.MySQLandthroughbatchjobsinthebackgroundorafteraschedule.ThestandardinterfacesarecommandlineandawebGUIcalledhue.

ThedatamodelofhiveisorganizedintableswithacorrespondingdirectoryinHDFS,partitionsthataredividedintosubdirectoriesinthetabledirectory,andbuckets,whicharethenextlevelofpartitioningperpartitionforfasterprocessingofdata.

ThearchitectureofHiveconsistsofseefig.11:

a. Theclientforuserinteractionb. TheHiveServer2(Thrift)serverthatinterpretsthedifferentexternallanguagesagainstHive.c. TheMetastorethatisthesystemcataloguethatstorestheMetadataaboutallthe

databases,tablesandpartitionscreatedbyinHive.d. TheDriverhandlestheHiveQLstatementduringcompilationandexecutionandeverystep

betweentoandfromtheclientgoesthroughtheDriver.e. TheCompilerthattranslatestheHiveQLstatementfromtheDriverintoaplanthatsimply

speakingconsistsoftheMapReducejobsthataresubmittedtothe(Hadoop)executionengine.(Thusooetal.,2009)(White,2010)(Turkington,2013).


15

Fig 10. Hive Architecture (Thusoo et al., 2009).

2.3.6.2 Presto PrestoisanSQLenginethatalsowasdevelopedbyFacebookanditsmainpurposeislow-latencyad-hocquerying.PrestousesANSISQLassyntax.ItcanrunonHadoopbutalsoonmanyotherdatabaseengines.WhenconnectedtoHadoop,PrestousestheHiveMetastoretofindthedataonHDFS.ThemaindifferencebetweenPrestoandHiveisthatPrestodoesnotprocessthequeriesusingMapReducebutinsteadallocatesallthetaskstotheworkers(nodes)memory,andschedulesnewtaskstotheworkersonceataskisfinished(Teradata,2016).ThismakesPrestomoresuitableforinteractivequerying,especiallywhentheworkersrunoutofmemorybecauseofthesizeofthedataprocessedinaquery.ThearchitectureisbuiltoftheclientthatsendsthequerytoacoordinatorthatparsestheSQLstatement,plansandschedulestheexecution.Theexecutionisthenpipelinedanddistributedtotheworkerslocatedclosestthedata.Theexecutionrunsmultiplestagessimultaneouslyanddataisstreamedbetweenallthestagesonlyusingmemoryforlow-latencyoutput(Presto,2016).


16

Fig 11. Presto Architecture (Teradata, 2016) .


17

3. Literature review

3.1 Purpose of literature review TherehavebeenmanySQLonHadoopbenchmarkscarriedout,totrytofigureoutwhichoneofthemanyprojectstogofor.Manyofthesebenchmarksaredonebyspecificvendorstryingtomarkettheirownproduct.Forthatreasontheyarenotalwaystrustworthywhenlookingtoimplementasolutionatanorganisation.Thevaryingresultsoftendependonthetestedinfrastructure,architectureandoptimizationsmadeforthespecificsolutions.Anotherfactorisalsowhenintimethetestswereconducted.Theopensourceprojectsareinconstantdevelopmenttoimproveandbecomemarketleaders,andnewversionsofthesolutionwillthereforebemoreoptimizedthanthepreviousones.Thiswillbeclearerafterreadingtheresultsfromthedifferentbenchmarkingattemptsthatwillbepresented.Mostofthetestedenginesinthebenchmarksbelowwillnotbeincludedinthebenchmarkingattruecallerbutwillshowthediversityofresults.ThereforIthinkitisimportanttoshowthedifferencesintheresultsandhowtheseresultscannotbetheonlysourceforchoosingandtrustingasolutionbyonlyreadingabenchmark.OnethingthatiscommoninthebenchmarksinthischapteristhatthattheyallbeatHiveinperformancebecausetheydon’tuseMapReducefordataprocessing,butmostlyprocessthedatainmemory.IhaveonlychosenPrestofromthemanyenginesthatarementionedinthissectionbecauseitmetalltherequirementsthattruecallerhad.

3.2 Benchmarks 3.2.1 Fast-data-hackathon Theallegro.techblochreportsresultsfromthefast-data-hackatonwhereagroupofuserscomparedvariousSQLonHadoopsolutionsona4nodetestclusterinJune2015.Theyused11differentqueriesintheirtests(AllegroTech,2015,June25).Thetestedsolutionswere:

- HiveonTez- Presto- Impala- Drill- Spark


18

Fig 12. fast-data-hackathon results (Treasure Data Blog, 2015, March 20).

IntheresultsaboveImpalaanddrillweretheclearwinnerswhenitcomestoexecutiontimebutaccordingtothegroupconductingtheteststheybothlackedsomefunctionalitythatcouldbefoundinothersolutionssuchasHiveandPresto.

3.2.2 Renmin University In2014attheRenminUniveristyinBeijing,ChinaagroupofresearcherstestedfivedifferentSQLonHadoopsolutionstocomparethemandbecauseofHivebeingthestandardtoolandnotbeingefficientenoughforinteractivequerying.Theyused11differentquerieswithdifferentcomplexityon25,50and100nodeclusters.Thesolutionstestedwere:

- Impala- Presto- Hive(toshowdifference)- Stinger- Shark


19

Fig 13. Renmin University results for 100-node cluster (Chen et al., 2014).

Theresultsshownotonlythedifferentresponsetimes,butalsothatsomequerieswerenotabletorunonallsolutionsexceptHive.Accordingtotheteststheresearchersconcludedthatnoneofthesolutionswerematureenoughtobeusedonadailybasis(Chenetal.,2014).

3.2.3 Commercial benchmarks

3.2.3.1 Pivotal (HAWQ) PivotalthatdevelopedtheSQLonHadoopsolutionHAWQbenchmarkedtheirproductagainstCloudera’sImpala,PrestoandStingerona8nodeclusterinJune2015.


20

Fig 14. Hawq’s speed up ratio vs. Impala (Soliman et al., 2014).

TheresultsinthegraphshowshowHAWQbeatsImpalainthemajorityofthequeriesruninthetestcases(Solimanetal.,2014)(Pivotal,2014,June25).ItisworthhighlightingthatImpalahasbeentheclearwinnerwhenitcomestospeedinallthegraphsabove.

3.2.3.2 Cloudera (Impala) InFebruary2016,ClouderareleasedtheirlatestImpalabenchmarkcomparingImpalatoHiveonTezandSparkona21nodecluster.

Fig 15. Cloudera’s Impala benchmark (Cloudera, 2016, February 11).


21

ThegraphaboveisoneofmanygraphsonCloudera’sblogshowinghowImpalabeatsitscompetitorsinallways(Cloudera,2016,February11).Itisworthnotingthatthisisanormalscenariocomingfromthevendorordeveloperstryingtoprovetheefficiencyoftheirproducts.

3.3 Hive not meant for low-latency querying BecausehiveisbasedonHDFSandHadoop,itismeantforbatchjobsfocusingonhighthroughputandscalabilityandcanbeusedforbothanalysisandETL.BecauseofHive’shighfaulttoleranceandwideusability,ithasbecomeandindustrywidestandardasadatawarehousetoolforHadoop(Turkington,2013).Italsothereforpreferredforbeingreliableandstableandotherlow-latencysolutionsareusedonthesideforad-hocquerying(Cloudera,2016).


22

4. Research methodology ThisthesiswillbethestartofaprocesstoexploreandtestSQLonHadoopqueryengines.ThescopeofthisthesiswillbetheevaluationofthequeryenginePresto.Prestowillbeevaluatedagainsthiveandthemainfocuswillbeonthelatencytimeoftheusecasequeries,Secondlyitwillalsoevaluatethetypeofusecasesitissuitableforandidentifyingandcomparingtheprosandconsagainstthebaselinesolutionhive.

Themethodologyusedinthisthesiswillbedesign-scienceandthedesign-scienceresearchprocessinterpretedbyPeffersetal.(2007)asthestepsinvolvedaresuitedforidentifying,implementingandevaluatingmychosensolution.Themainfocususingthismethodologywillbeontheevaluationofthesolutionratherthanbuildinganartefact,focuswillbeondeployingandconfiguringPrestothatisanalreadydevelopedopensourcequeryengine.Definingtherightusecaseswillalsobeabigpartofthedesign.

Thestepsindesign-scienceresearchprocessare:

1. Identifyproblem&Motivate.2. DefineObjectivesforsolution.3. Design&Developement.4. Demonstration.5. Evaluation.6. Communication.

Fig. 16 Design Science Research Methodology Peffers, Tuunanen, Rothenberger & Chatterjee, 2007).

4.1 Identify problems & Motivate Thefirststepinthedesign-scienceresearchprocessistoidentifyandmotivatetheproblems.Attruecallerthemainproblemisthelatencyandexecutiontimeofhivewhenrunningad-hocqueries.

Problems:

1. Hive/MapReduceistooslowfordatadiscovery/ad-hocqueries.


23

Asnotedearlier,theproblemwiththecurrentsetupattruecalleristhatthedatawarehouseisaccessedthroughHiveusingMapReduceonHadoop.Runningcomplexqueriestakesoftenunnecessarylongtimeandthisisbothbadfortheanalyticsteamandforthestakeholdersthatrequestad-hocanalysisonaregularbasis.MapReduceisbetterforbatchjobsthatcanbescheduledforstandardanalysisratherthanad-hocqueryinganddatadiscovery.HiveandMapReducetakestimebecauseoftheprocessingisdonebywritingtheoutputondisk.

4.2 Define Objectives for solution Theobjectivesforthesolutionwillinthiscasebetherequirementsforidentifyingthenewsolution.

Requirements:

Table 4. Requirements

Beforetryingtoidentifythedifferentsolutions,discussionswillbeheldwiththeendusersattheanalyticsteamtogatherrequirementsregardingthesolution.Therearemanyopensourcesolutionsthatcanbetested,andwiththerightrequirements,thelistcanbenarroweddown.

Heretheusecaseswillbedefinedbasedoneverydayad-hocanalysiscarriedoutbytheanalyticsteam.Theusecaseswillincrementallybecomemorecomplextocomparethetimeandresourcesusedbythesolutionswitheachother.AninitialbaselinewheretheusecasesareruninHivewouldbethesecondstepsothatthesolutionscanbecomparedwiththecurrentsituation.AlltheusecaseswillbetestedinsolutionssupportingSQL-syntaxandthequerieswillbethesameorvaryinnativesyntax.(Eg.HiveqldoesnotsupportallstandardSQLfunctionsandSQLcanalsovarydependingontheengine/dialect.)

4.3 Design & Development Inthisstepdesigninganddevelopingandartefactisthecommonpracticeindesign-science.Inthiscase,researchingandIdentifyingthesolutionwillbethepre-stepbeforedeployingthesolutiononfirstatest,andafterontheproductionenvironment(Hadoopcluster).

PrestowasthechosensolutionbecauseitmetalltherequirementsthatwasmentionedaboveandwaseasyenoughtodeployusingtheYARNapplicationversion.

Thefirststepwillbetotestthesolutiononasmallertestenvironmentbeforeimplementingittotheproductionenvironment.ForPresto,thetestenvironmentwillbeadockercontainer,whichisatype

1.Opensource2.SQLinterface3.SupportstoragefiletypeORC4.SupportsamedatatypesasHivee.g.structs,array,Map5.UDF/UDA(UserDefinedFunctions/Aggregations)6.Standardanalyticalfunctions7.Windowfunctions


24

ofvirtualenvironment.ThecontainerwillhaveaonenodeHadoopclusterinstalledandreadytoteston.Thetestsetupwasinstallingthesolution,importingadatasetandtestrunningsomerandomqueries.

4.4 Demonstration Thisstepwillbeforpreparingtheevaluationofthesolution.Inthiscasedefiningtheusecasesfortheevaluationisthemaingoalofthisstep.

Theimplementationofthesolutionwillnotbeheavilydocumented,eventhoughabigpartofthetimeinthisprojectwillbespentondeployingandconfiguringthesolutiontothecurrentenvironmentattruecaller.Thetestingintheproductionenvironmentwillbebasedontheusecasesthatwerepre-defined.ThisstepwillshowhowthesolutiondiffersfromHive,andwhattheprosandconsfortheparticularqueriesare.Theusecaseswillberun4times.2timesduringworkhourswhenthereisnormalloadontheclusterand2timesduringtheevening.Theaveragetimeinsecondswillbeusedtoshowtheresultofthelatencytime.

4.5 Evaluation Thisstepistoevaluatetheresultsofthesolution.

Theresultsoftheusecaseswillberecordedandvisualizedusingbarchartstoshowthedifferenceinlatencytimebetweenthedifferentsolutions.Theresultchapterofthethesiswillbemainlyfocusingonthisstepofthemethodology.Otherconclusionsfromtheusecaseswillalsobedrawnfromthebaselineandthenewsolutione.g.prosandconsandsuitableusecases.

4.6 Communication Thethesisitselfrepresentsthisstep.Togetherwiththethesis,theconclusionsandrecommendationsforfurthertestingattruecaller,apresentationatLTUscampuswillbeheldforsupervisorsandotherstudents.


25

5. Result

5.1 Cluster details TheHadoopclusterattruecallerconsistsof14nodeswithfollowingcomponents:

OS Linux(debian-kernel)3.2.0.4-amd64CPUmodel Intel(R)Xeon(R)[email protected] 32RAM 128GBHDD ~20TB

Table 5. Cluster details

5.2 Implementation Prestowasdeployedontruecaller’sHadoopclusterusingaversionthatallowsittorunasaYARNapplicationusingall14nodes.OnenodesharedasbothCoordinatorandWorkerandtherestasWorkers.Thewholesetupandconfigurationcanbefoundunderappendices7.1Prestoinstallation.

5.3 Use cases and latency in seconds 1.Select*withrestrictionandLimitationtype1.Tablesize14Billionrows.

Hive:~20sec

Presto:~5sec

Conclusion:

Presto4xfaster.

2.Select*withrestrictionandLimitationtype2.Tablesize14Billionrows.

Hive:~40sec

Presto:~7sec

Conclusion:

Presto~6xfaster.

3.Select*withrestrictionandLimitationtype3.Tablesize14Billionrows.

Hive:~200sec

Presto:~15sec

Conclusion:

Presto~13xfaster.

4.Selectcountwithrestriction.Tablesize14Billionrows.


26

Hive:~90sec

Presto:~23sec

Conclusion:

Presto~4xfaster.

5.Select*withrestriction,limitationandOrderBy.Tablesize14Billionrows.

Hive:~340sec

Presto:RunsoutofmemorybecauseofOrderBy,tooexpensive

Conclusion:

Notsuitableusecaseforpresto.

6.Selectwith2aggregatefunctions.Tablesize14Billionrows.

Hive:~600sec

Presto:~120sec

Conclusion:

Presto~5xfaster.

7.Selectwithaggregates,groupbyandlimitationtype1.Tablesize14Billionrows.

Hive:~120sec

Presto:~40sec

Conclusion:

Presto~3xfaster.

8.Selectwithaggregates,groupbyandlimitationtype2.Tablesize14Billionrows.

Hive:~120sec

Presto:~50sec

Conclusion:

Presto~2,5xfaster.

9.Selectwithmultipleaggregates,groupby,limitationsandJointable3times.Tablesize14Billionrows.

Hive:~6000sec

Presto:~7000sec

Conclusion:

Hive1,2xfaster.


27

5.4 Result table

Fig 17. Query execution time in seconds

20 5 40 7 200 15 90 23 340 600 120 120 40 120 50 600070000

50

100

150

200

250

300

350

400

450

500hive

presto

hive

presto

hive

presto

hive

presto

hive

presto

hive

presto

hive

presto

hive

presto

hive

presto

q1 q2 q3 q4 q5 q6 q7 q8 q9

Second

s


28

6. Discussion

6.1 Presto at truecaller LookingatthequerieswherePrestoperformedfasterthanHive,theaveragegainoflatencytimeisaround5timesfaster.Thisisabitlowerthansomeexamplebenchmarksthatcanbefoundonline.Howeveritisstillabiggainwhenneededtorunqueriesoverandoveragainduringwork.Prestoshowedclearlythatitismoresuitedforreturningfastinsightsonasmallerscaleandissuitedforad-hocanalysisforsomeonewhoisfamiliarwiththedatasetsandknowswhattoandwheretolookfor.Itislesssuitedformoreexpensivequerieswheretheworkerneedstostorewithlargeamountsofmetadata,andriskstorunoutofmemory.Intheconfigurationandtesting,allocating70GBofmaximumquerymemorywasnotenoughforquery5whichcontainedanorderbyoperation,andranoutofmemory.Prestoisalsonotsuitedformultiplejoinsbetweenlargetables,whereitperformedslowerthanHive.Intruecaller’scasePrestoisfineformanyofthecommonday-to-dayad-hocusecases.Attruecallermultipletoolsareusedwithintheanalyticsteamandknowingtheprosandconsofthetoolshelpstheuserstochoosewhichonetogofor.ForsomeETLjobs,Sparkisbeingusedandtherefortheteamwillnotsufferfromaddinganothertooltotheirtoolbox.Prestohassomeminorsyntaxdifferencesinbuttheyareeasytolearnandshouldnotbeseenasanobstacle.AnotherfindingfromthisresearchisthatPrestocanbeconfiguredandtunedonverydetailedlevel.Theresultspresentedinthegraphareafternumerousiterationsofconfigurationsandtuning.Thereisalsothestandardversion,whichrequiresmoreconfigurationsonanoperatingsystemlevel.TheversiontestedattruecallerwastheYARNapplicationversion.ThereforethereismoreroomfortuningandtestingPrestobeforechoosingthefinalsettingsfortruecaller’sHadoopclusterandenvironment.MyrecommendationwouldbeaftertuningandconfiguringPresto,tousePrestoforsomemorelightweightqueriestogetherwithHiveforthebiggerjobs(ETL,batchetc.).

6.2 Use cases TheusecaseschosenfortheevaluationarebasedonworkIdidfortruecalleronthesideofthethesiswork.Thequeriesusedareactualad-hocrequestsfortheanalyticsteam,whichItookupontolearnmoreaboutthedatathatwasanalysedonadailybasis.ThatwayIgotagoodpictureaboutwhattypesofquerieswereoftenrunandwillberuninanewsolutionifimplemented.

6.3 Limitations Thegoalfromthebeginningwastoevaluate2-3solutionsandcomparethemtoHivelikeIdidwithPresto.Duetosomeresourcelimitationsandthecomplexityoftheproject,Icouldnotproceedwithtestingmoresolutions.Thishadmostlytodowithmenotriskingcompromisingthedailyworkbytryingtoimplementnewsolutionsontheproductionclusterwithoutseniorsupervision.

6.4 Commercial alternatives Thereasontotestopensourcesolutionswasasearliermentionedtokeepthecostsdown.ThereareotherwisecommercialalternativesthatperformverywellandrunstablewithHadoop.AnexampleisExasolanin-memoryanalyticsdatabase.IhavetestedandworkedonExasolinotherprojectsandwouldrecommenditasanlow-latencysolutionforcompaniesthatcanconsiderpayingforsuchasolution.


29

6.5 Further research TuningandconfiguringPrestowithprofessionalhelpwouldbeafirststeptoseehowmuchfasterPrestocouldbecome.Itwouldalsobeinterestingtoseeifthereareanygainsregardingheavierqueriesandjobscomparedtothecurrentconfigurationsattruecaller.AnextstepwouldbetoseeifgainsinexecutiontimewouldbepossiblebytryingtotunetheparametersoftheORCfile.Thiswasanotherintentiontotestduringthisresearchbutcouldnotbedoneduetotimelimitations.

ApacheDrillandApacheTajoaretheothertwosolutionsthatweremeanttobetestedandevaluatedinthisthesiswork.Tajowasactuallyimplementedbutdidnotsupportsomeoftherequiredfiletypesanddatatypes.IhavebeeninclosecontactwiththedevelopersofTajoanditshouldmeetalloftruecaller’srequirementsinthenextrelease,whichwasinitiallyplannedforsometimeinApril2016,butwasdelayed.Afterthereleaseandifthereistimeandresources,lookingintobothofthesesolutionscouldbeofinterestfortruecaller.

Ifthereareadditionalresourcesandtime,IwouldrecommendtestingImpalabychangingfileformatsfromORCtoParquetanddorunthesametypeoftests.Becauseoftheresultsfrommostofthebenchmarksreadduringthisresearch,ImpalaisproventobeastrongcandidateforalightweightSQLonHadooptool.Theonlyproblemisthedifferenceinthefiletypes(ORCvs.Parquet)thattruecallerhaschosentostore.


30

7. Conclusion TherearemanySQLonHadoopsolutionsonthemarkettoday.PrestoisonoftheknownnamesofopensourceSQLonHadoopsolutionsandwaschosenasthetooltobetestedataftermeetingalltherequirements.SinceonlyPrestowastested,duetolimitations,itcannotbesaidthatitistheonlyand/orperfectsolutionfortruecaller’scase.Howeverithasansweredtheresearchquestionsofthisthesisbyprovingtobeonaverage5timesfasterthanHive.Prestowasprovenworkbestforlighterad-hocanalysisrunningonsmallerdatasetsorsampledatatakenformlargersetstoproveapoint.Prestodidnotworkwellwithlargerdatasetsandheavierqueriescontainingmultiplejoins/selfjoins.Prestocanbetunedandoptimizedtofittheparticularenvironmenttoperformatitsbest.Theuse-casesandalltestingweredoneattruecallerinStockholmandtheresultsarealldependingontheconfigurationandhardwareofthetruecaller’sHadoopcluster.Prestoisnowinusebytheanalyticsteamattruecallerforspecificad-hocusecasesandwillnotpermanentlyreplaceHiveoranyothertoolatthecompany.User’sthatknowthedatawillbeabletomaketherightdecisionforwhentoandnottousePresto.UsingPrestofortherightuse-caseswillsavetruecaller’sanalyticsteamtimeandwilladdefficiencytotheirwork,whichiswasthemainpurposeforthemtowanttotestanothersolution.BydeployingPrestoattruecaller,thisgoalhasbeenreachedfornow.


31

8. References

8.1 Books and articles • White,T.(2010).Hadoop:thedefinitiveguide.(2nded.)Farnham:O'Reilly.• Chen,Y.,Qin,X.,Bian,H.,Chen,J.,Dong,Z.,Du,X.,&Zhang,H.(2014).AStudyofSQL-on-

HadoopSystems.LectureNotesInComputerScience,(8807),154-166.• Walker,R.(2015).FromBigDatatoBigProfits:SuccesswithDataandAnalytics.Oxford

UniversityPress.• Gandomi,A.,&Haider,M.(2015).Beyondthehype:Bigdataconcepts,methods,and

analytics.InternationalJournalOfInformationManagement,35137-144.doi:10.1016/j.ijinfomgt.2014.10.007

• NationalInstituteofStandardsandTechnology.(2015)BigDataInteroperabilityFramework:Volume1,DefinitionsU.S.DepartmentofCommerce.

• Turkington,G.(2013).HadoopBeginner'sGuide.Birmingham:PacktPublishing.• Cervone,F.(2015).BIGDATAANDANALYTICS.OnlineSearcher,39(6),22-27.• Maitrey,S.,&Jha,C.(2015).MapReduce:SimplifiedDataAnalysisofBigData.Procedia

ComputerScience,57(3rdInternationalConferenceonRecentTrendsinComputing2015(ICRTC-2015),563-571.doi:10.1016/j.procs.2015.07.392

• Jukić,N.,Sharma,A.,Nestorov,S.,&Jukić,B.(2015).AugmentingDataWarehouseswithBigData.InformationSystemsManagement,32(3),200-209.doi:10.1080/10580530.2015.1044338

• Thusoo,A.,Sarma,J.,Jain,N.,Shao,Z.,Chakka,P.,Anthony,S.,&...Murthy,R.(2009).Hive-Awarehousingsolutionoveramap-reduceframework.ProceedingsOfTheVLDBEndowment,2(2),1626-1629.

• Connolly,T.M.&Begg,C.E.(2005).Databasesystems:apracticalapproachtodesign,implementationandmanagement.(4.,[rev.]ed.)Harlow:Addison-Wesley.

• Maposa,T.T.,Sethi,M.(2015).SQL-on-Hadoop:TheMostProbableFutureinBigDataAnalytics.AdvancesinComputerScienceandInformationTechnology(ACSIT),2(13),9-14.

• Srinivasa,K.G.,Mupalla,A.K.(2015).GuidetoHighPerformanceDistributedComputing,casestudieswithHadoop,ScaldingandSpark,Springer

• Soliman,M.,Petropoulos,M.,Waas,F.,Narayanan,S.,Krikellas,K.,&Baldwin,R.etal.(2014).Orca.ProceedingsOfThe2014ACMSIGMODInternationalConferenceOnManagementOfData-SIGMOD'14,337-348.http://dx.doi.org/10.1145/2588555.2595637

• Choudhury,K.(2015,October12).Truecaller'sIndiauserbasegrowsto100mn,BusinessStandard,NewDelhi.Retrievedfromhttp://www.business-standard.com/article/companies/truecaller-s-india-user-base-grows-to-100-mn-115101200037_1.html

• Rukmini,S.(2014,December13).Indialosingwaronspam,TheHindu.Retrievedfromhttp://www.thehindu.com/sci-tech/technology/10-lakh-complaints-about-spam-calls-in-the-last-three-years/article6689376.ece

• Peffers,K.,Tuunanen,T.,Rothenberger,M.A.,&Chatterjee,S.(2007).ADesignScienceResearchMethodologyforInformationSystemsResearch.JournalOfManagementInformationSystems,24(3),45-77.


32

8.2 Websites and blog posts • Truecaller.(2016).MAKINGYOURPHONESMARTER.Retrievedfrom

https://www.truecaller.com/about• Apache.(2016).HadoopWiki,Projectdescription.Retrievedfrom

http://wiki.apache.org/hadoop/ProjectDescription• ApacheHadoop.(2016).ApacheHadoopYARN.Retrievedfrom

http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html• ApacheSliderIncubator.(2016).ApacheSliderArchitecture.Retrievedfrom

https://slider.incubator.apache.org/design/architecture.html• Teradata.(2016).Presto,aFaster,MoreScalable,MoreFlexibleOpenSourceSQLonHadoop

Engine.Retrievedfromhttp://assets.teradata.com/resourceCenter/downloads/Datasheets/EB8901.pdf?processed=1

• Presto.(2016).Overview.Retrievedfromhttps://prestodb.io/overview.html• Cloudera.(2016).Whatishive?Retrievedfrom

https://archive.cloudera.com/cdh4/cdh/4/hive/• AllegroTech.(2015,June10).FastDataHackathon[Blogpost].Retrievedfrom

http://allegro.tech/2015/06/fast-data-hackathon.html• Pivotal.(2014,June25).PivotalHAWQBenchmarkDemonstratesUpTo21xFaster

PerformanceonHadoopQueriesThanSQL-likeSolutions[Blogpost].Retrievedfromhttps://blog.pivotal.io/big-data-pivotal/products/pivotal-hawq-benchmark-demonstrates-up-to-21x-faster-performance-on-hadoop-queries-than-sql-like-solutions

• Cloudera.(2016,February11).NewSQLBenchmarks:ApacheImpala(incubating)UniquelyDeliversAnalyticDatabasePerformance[Blogpost].Retrievedfromhttps://blog.cloudera.com/blog/2016/02/new-sql-benchmarks-apache-impala-incubating-2-3-uniquely-delivers-analytic-database-performance/

• Fishman,J.(2014,July14).AppleSolarFrenzy:BenefitingShareholdersandthePlanetAlike[Blogpost].Retrievedfromhttp://solarstockideas.com/apple-solar-frenzy-benefiting-shareholders-planet-alike/


33

9. Appendices

9.1 Appendix A – Presto installation

Guide to install presto on hadoop cluster.

On docker container with cloudera one node cluster for test: Prerequisits:

§ HDFS§ YARN§ ZooKeeper§ Java§ Maven

Step 1. Install Slider to run presto application on yarn:

1. Create yarn user on containers hdfs with:

hdfs dfs -mkdir -p /user/yarn hdfs dfs -chown yarn:yarn /user/yarn

1. Downloadslider0.80.0foryarn(http://apache.mirrors.spacedump.net/incubator/slider/0.80.0-incubating/slider-assembly-0.80.0-incubating-all.tar.gz)

2. Untarfilewithcommand:tar -xvf slider-0.80.0-incubating-all.tar.gz

3. Opennewdirectoryandgotoslider-assembly/targetdirectoryandcopy0.80.0-incubating-all.ziptoanewfolderwhereyouwanttoinstallslider.

4. Unzipfileinfolder,thesliderfolderwillnowbecreated.5. Gotosliderfolder/confandopenfileslider-env.shsetenvvariables:

export JAVA_HOME=/usr/lib/jvm/java-8-oracle/ export HADOOP_CONF_DIR=/etc/hadoop/conf

6. Open file slider-client.xml change set values:

<property> <name>hadoop.registry.zk.quorum</name> <value>localhost:2181</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>localhost:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>localhost:8030</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:8020</value> </property>

7. Test slider with command :

${slider-install-dir}/slider-0.80.0-incubating/bin/slider version


34

If you get output "Compiled against Hadoop 2.6.0", slider is installed right.

Step 2. Install presto:

1. Download presto-yarn package from github:(gitclonehttps://github.com/prestodb/presto-yarn.git)

2. Buildpackagewithmaven:mvn clean package

3. Go to folder created and folder presto-yarn-package/target and copy presto-yarn-package-1.0.0-SNAPSHOT-0.130.zip into slider folder.

4. Gobacktopresto-yarn-packagefolderandto/src/main/resourcesandcopyfilesappConfig.jsonandresources-singelnode.jsonintosliderfolder.

5. Insliderolderrenameresources-singlenode.jsontoresources.jsonandopenitandeditsettings:{"schema":"http://example.org/specification/v2.0.0","metadata":{},"global":{"yarn.vcores":"1"},"components":{"slider-appmaster":{},"COORDINATOR":{"yarn.role.priority":"1","yarn.component.instances":"1","yarn.memory":"1500"}}}

6. OpenappConfig.jsonandeditsettings:{"schema":"http://example.org/specification/v2.0.0","metadata":{},"global":{"site.global.app_user":"yarn","site.global.user_group":"hadoop","site.global.data_dir":"/var/lib/presto/data","site.global.config_dir":"/var/lib/presto/etc","site.global.app_name":"presto-server-0.130","site.global.app_pkg_plugin":"${AGENT_WORK_ROOT}/app/definition/package/plugins/","site.global.singlenode":"true","site.global.coordinator_host":"${COORDINATOR_HOST}","site.global.presto_query_max_memory":"50GB","site.global.presto_query_max_memory_per_node":"512MB","site.global.presto_server_port":"8080",


35

"site.global.catalog":"{'hive':['connector.name=hive-cdh5','hive.metastore.uri=thrift://${NN_HOST}:9083'],'tpch':['connector.name=tpch']}","site.global.jvm_args":"['-server','-Xmx1024M','-XX:+UseG1GC','-XX:G1HeapRegionSize=32M','-XX:+UseGCOverheadLimit','-XX:+ExplicitGCInvokesConcurrent','-XX:+HeapDumpOnOutOfMemoryError','-XX:OnOutOfMemoryError=kill-9%p']","application.def":".slider/package/PRESTO/presto-yarn-package-1.1-SNAPSHOT-0.130.zip","java_home":"/usr/lib/jvm/java-8-oracle/"},"components":{"slider-appmaster":{"jvm.heapsize":"128M"}}}Testdifferentsettingsforenvironmentnodememory1024MBwastoomuchforusandwechangedto:site.global.presto_query_max_memory_per_node":"512MB"

7. Insliderfolderchangeusertoyarn(suyarn)andruncommandtoinstallpresto:

bin/slider package --install --name PRESTO --package presto-yarn-package-*.zip

8. Run presto with command:

bin/slider create presto1 --template appConfig.json --resources resources.json

If exited with status 0 presto app shoul now be running.

9. Go to YARN resourcemanager web UI to check if presto is running (http://science2.truecaller.net:8088) : (See first row application presto1 running).

10. Create new folder and download presto client jar file (https://repository.sonatype.org/service/local/artifact/maven/content?r=central-proxy&g=com.facebook.presto&a=presto-cli&v=RELEASE)

11. Rename jar file to to only presto (mv *.jar presto). Run command to make it executable:

chmod +x presto

12. To test presto go to hive client and create a table with data. Now go back to folder with presto as yarn user (su yarn) and type command to run presto:

./presto --server localhost:8080 --catalog system --schema default (or which schema wanted to use)

type show tables; if table is showing, tryout a query.

On hdp3 cluster:


36

Step 1. Install Slider to run presto application on yarn:

1. Create yarn user on containers hdfs with:

hdfs dfs -mkdir -p /user/yarn hdfs dfs -chown yarn:yarn /user/yarn

1. Downloadslider0.80.0foryarn(http://apache.mirrors.spacedump.net/incubator/slider/0.80.0-incubating/slider-assembly-0.80.0-incubating-all.tar.gz)

2. Untarfilewithcommand:tar -xvf slider-0.80.0-incubating-all.tar.gz

3. Opennewdirectoryandgotoslider-assembly/targetdirectoryandcopy0.80.0-incubating-all.ziptoanewfolderwhereyouwanttoinstallslider.

4. Unzipfileinfolder,thesliderfolderwillnowbecreated.5. Gotosliderfolder/confandopenfileslider-env.shsetenvvariables:

export JAVA_HOME=/usr/lib/jvm/jdk-8-oracle-x64/ export HADOOP_CONF_DIR=/etc/hadoop/conf

6. Open file slider-client.xml change set values:

<property> <name>hadoop.registry.zk.quorum</name> <value>---------------- </property> <property> <name>yarn.resourcemanager.address</name> <value>-----</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value> </value> </property> <property> <name>fs.defaultFS</name> <value>hdfs---8020</value> </property>

7. Test slider with command :

${slider-install-dir}/slider-0.80.0-incubating/bin/slider version

If you get output "Compiled against Hadoop 2.6.0", slider is installed right.

Step 2. Install presto:

1. Download presto-yarn package from github:(gitclonehttps://github.com/prestodb/presto-yarn.git)

2. Buildpackagewithmaven:mvn clean package

3. Go to folder created and folder presto-yarn-package/target and copy presto-yarn-package-1.0.0-SNAPSHOT-0.130.zip into slider folder.

4. Gobacktopresto-yarn-packagefolderandto/src/main/resourcesandcopyfilesappConfig.jsonandresources-singelnode.jsonintosliderfolder.

5. Insliderolderrenameresources-singlenode.jsontoresources.jsonandopenitandeditsettings:{"schema":"http://example.org/specification/v2.0.0","metadata":{},


37

"global":{"yarn.vcores":"8"},"components":{"slider-appmaster":{},"COORDINATOR":{"yarn.role.priority":"1","yarn.component.instances":"1","yarn.component.placement.policy":"1","yarn.memory":"4000"},"WORKER":{"yarn.role.priority":"2","yarn.component.instances":"13","yarn.component.placement.policy":"1","yarn.memory":"8000"}}

6. OpenappConfig.jsonandeditsettings:{"schema":"http://example.org/specification/v2.0.0","metadata":{},"global":{"site.global.app_user":"yarn","site.global.user_group":"hadoop","site.global.data_dir":"/var/lib/presto/data","site.global.config_dir":"/var/lib/presto/etc","site.global.app_name":"presto-server-0.142","site.global.app_pkg_plugin":"${AGENT_WORK_ROOT}/app/definition/package/plugins/","site.global.singlenode":"true","site.global.coordinator_host":"${COORDINATOR_HOST}","site.global.presto_query_max_memory":"50GB","site.global.presto_query_max_memory_per_node":"4GB","site.global.presto_server_port":"8080","site.global.catalog":"{'hive':['connector.name=hive-cdh5','hive.metastore.uri=thrift://hdp3.truecaller.net:9083'],'tpch':['connector.name=tpch']}","site.global.jvm_args":"['-server','-Xmx8192M','-XX:+UseG1GC','-XX:G1HeapRegionSize=32M','-XX:+UseGCOverheadLimit','-XX:+ExplicitGCInvokesConcurrent','-XX:+HeapDumpOnOutOfMemoryError','-XX:OnOutOfMemoryError=kill-9%p']","application.def":".slider/package/PRESTO/presto-yarn-package-1.1-SNAPSHOT-0.142.zip","java_home":"/usr/lib/jvm/jdk-8-oracle-x64/"},"components":{"slider-appmaster":{"jvm.heapsize":"128M"


38

}}}

7. Insliderfolderchangeusertoyarn(suyarn)andruncommandtoinstallpresto:

bin/slider package --install --name PRESTO --package presto-yarn-package-*.zip

8. Run presto with command:

bin/slider create presto1 --template appConfig.json --resources resources.json

If exited with status 0 presto app shoul now be running.

9. Go to YARN resourcemanager web UI to check if presto is running (See first row application presto1 running).

10. Create new folder and download presto client jar file (https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.142/presto-cli-0.142-executable.jar)

11. Rename jar file to to only presto (mv *.jar presto). Run command to make it executable:

chmod +x presto

12. To test presto go to hive client and create a table with data. Go to the resource manager web interface.find presto1 under running applications and find URL to coordinator node.

13. Now go back to folder with presto as yarn user (su yarn) and type command to run presto:

./presto --server h[URL to coordinator]:8080 --catalog hive --schema default (or which schema wanted to use)

typeshowtables;iftableisshowing,tryoutaquery

9.2 Appendix B – Use case queries

1.SELECT*withrestrictionANDLIMIT1:

SELECT*

FROMapp_events

WHEREUPPER(user.country_code)='IN'

ANDdt=20160221

ANDh=02


39

LIMIT10000;


SELECT*

FROMapp_events

WHEREdt=20160221

ANDappsearch.typeIN('1','2')

LIMIT10000;


SELECT*FROMapp_events

WHEREuser.register_id=2107422

ANDappsearch.type='2'

ANDdtbetween20160123AND20160125

LIMIT50;

4.SELECTCOUNT(*)withrestriction

SELECTCOUNT(*)FROMapp_events

WHEREupper(user.country_code)='IN'

ANDappsearch.type='2'

ANDdt=20160123;

5.SELECT*withrestrictionLIMITANDORDERBY:

SELECT*

FROMapp_events

WHEREdt=20160221


40

ANDuser.country_codeISNOTNULL


ORDERBYUPPER(user.country_code)

LIMIT10000;

6.SELECT*withrestrictionLIMITANDSORTBY:

SELECT*

FROMapp_events

WHEREdt=20160221

ANDh=02

ANDuser.country_codeISNOTNULL


SORTBYUPPER(user.country_code);

7.SimpleSELECTwithaggregates:

SELECT

SUM(appsearchv2.attempts[0].latency)sum_latency,

AVG(appsearchv2.attempts[0].latency)avg_latency

FROMdata.app_events

WHEREsize(appsearchv2.attempts)>0

ANDdtbetween20160130AND20160202

ANDupper(user.country_code)='IN'

ANDappsearchv2.attempts[0].endpointnotlike'%localhost%';

8.SELECTwithaggregatesANDGROUPBY:

SELECTuser.register_id,

appsearch.searchrequest.normalizedphonenumberasincoming_nr,


41

SUM(if(appsearch.type=2,1,0))incoming_calls,

SUM(if(appsearch.type=5,1,0))sms

FROMdata.app_events

WHEREdt=20160121

ANDappsearch.searchrequest.searchcountrycode='SE'

ANDappsearch.searchrequest.normalizedphonenumberISnotNULL

GROUPBYuser.register_id,

appsearch.searchrequest.normalizedphonenumber

LIMIT10000;

9.SELECTwithaggregateswithgroupANDorderby:





FROMdata.app_events

WHEREdt=20160121




appsearch.searchrequest.normalizedphonenumber

orderbyuser.register_id,

incoming_nr

LIMIT10000;

10.SELECTFROMinnerquerywithinnerGROUPBYANDouterorderby:


42

SELECTregister_id,

incoming_nr,

incoming_calls,

sms

FROM(





FROMdata.app_events

WHEREdt=20160125




appsearch.searchrequest.normalizedphonenumber)x

WHEREincoming_calls>0ORsms>0

orderbyregister_id

LIMIT10000;

11.JOINWITH2INNERQUERIESWITHAGGREGATESANDORDERBY:

SELECT

a.city,

COUNT(DISTINCTa.registerid)nr_IOS_profiles

FROM(

SELECT

LOWER(city)city,

registerid

FROMdata.user_profile_data

WHERELOWER(city)<>''


43

ANDdt=20160221

ANDregisterid<>0)a

INNERJOIN(

SELECT

id

FROMuserprimaryserviceprovider

WHERELOWER(os)LIKE'iphone%'

ANDdt=20160221)b

ONb.id=a.registerid

GROUPBYa.city

ORDERBYnr_IOS_profilesDESC

LIMIT10;

12.3joinswithaggregatesANDsortby:

SELECT

FROM_UNIXTIME(UNIX_TIMESTAMP()-86400,'yyyyMMdd')ASdt,

a.countrycode,

a.incoming_phonenr,

COALESCE(c.incoming_calls_24h,0L)incoming_calls_24h,

COALESCE(c.sms_24h,0L)sms_24h,

COALESCE(b.incoming_calls_3d,0L)incoming_calls_3d,

COALESCE(b.sms_3d,0L)sms_3d,

a.incoming_calls_7d,

a.sms_7d

FROM

(SELECT

countrycode,

incoming_phonenr,

incoming_calls_7d,


44

sms_7d

FROM(

SELECT

appsearch.searchrequest.searchcountrycodeascountrycode,

appsearch.searchrequest.normalizedphonenumberASincoming_phonenr,

SUM(if(appsearch.type=2,1,0))incoming_calls_7d,

SUM(if(appsearch.type=5,1,0))sms_7d

FROMdata.app_events

WHEREdt>=20160101ANDdt<=20160107

ANDupper(appsearch.searchrequest.searchcountrycode)IN('SE','US')

ANDappsearch.searchrequest.normalizedphonenumberISNOTNULL

GROUPBY

appsearch.searchrequest.searchcountrycode,


WHEREincoming_calls_7d>0ORsms_7d>0

SORTBYcountrycode)a

LEFTJOIN

(SELECT

countrycode,

incoming_phonenr,

incoming_calls_3d,

sms_3d

FROM(

SELECT



SUM(if(appsearch.type=2,1,0))incoming_calls_3d,

SUM(if(appsearch.type=5,1,0))sms_3d

FROMdata.app_events

WHEREdt>=20160101ANDdt<=20160103


45



GROUPBY



WHEREincoming_calls_3d>0ORsms_3d>0

SORTBYcountrycode)b

ONa.countrycode=b.countrycode

ANDa.incoming_phonenr=b.incoming_phonenr

LEFTJOIN

(SELECT

countrycode,

incoming_phonenr,

incoming_calls_24h,

sms_24h

FROM(

SELECT



SUM(if(appsearch.type=2,1,0))incoming_calls_24h,

SUM(if(appsearch.type=5,1,0))sms_24h

FROMdata.app_events

WHEREdt=20160101



GROUPBY



WHEREincoming_calls_24h>0ORsms_24h>0

SORTBYcountrycode)c


46

ONb.countrycode=c.countrycode

ANDb.incoming_phonenr=c.incoming_phonenr

SORTBYcountrycode

;

13.Createtableoutof3joinsFROMlargetablewithmultipleaggregatesANDGROUPBY.

CREATETABLEsahir.nr_search_events_30days_v2AS

SELECT

a.dt,

a.country_code,

COALESCE(b.hits,0L)AStotal_result_hits,

COALESCE(a.hits,0L)AS1_result_hits,

COALESCE(c.hits,0L)AS0_result_hits

FROM

(SELECT

dt,

LOWER(user.country_code)country_code,

COUNT(*)hits

FROMdata.app_events

WHEREdtBETWEEN20160201AND20160301

ANDSIZE(search.sources)>0

ANDLOWER(search.sources[0])LIKE'%ugc%'

ANDsearch.result_size=1

GROUPBY

dt,

LOWER(user.country_code))a

LEFTJOIN

(SELECT

dt,


47


COUNT(*)hits

FROMdata.app_events



ANDsearch.result_size>0

GROUPBY

dt,

LOWER(user.country_code))b

ONa.dt=b.dt

ANDLOWER(a.country_code)=LOWER(b.country_code)

LEFTJOIN

(SELECT

dt,


COUNT(*)hits

FROMdata.app_events



ANDsearch.result_size=0

GROUPBY

dt,

LOWER(user.country_code))c

ONb.dt=c.dt

ANDLOWER(b.country_code)=LOWER(c.country_code)

BACHELOR THESIS -...

Documents

Transcript of BACHELOR THESIS -...