Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the...
Transcript of Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the...
BigDataEssentialsCopyright©2016byAnilK.Maheshwari,Ph.D.
Bypurchasingthisbook,youagreenottocopyordistributethebookbyanymeans,mechanicalorelectronic.
Nopartofthisbookmaybecopiedortransmittedwithoutwrittenpermission.
Otherbooksbythesameauthor:
DataAnalyticsMadeAccessiblethe#1BestsellerinDataMining
Moksha:LiberationThroughTranscendence
PrefaceBigDataisanew,andinclusive,naturalphenomenon.Itisasmessyasnatureitself.ItrequiresanewkindofConsciousnesstofathomitsscaleandscope,anditsmanyopportunitiesandchallenges.UnderstandingtheessentialsofBigDatarequiressuspendingmanyconventionalexpectationsandassumptionsaboutdata…suchascompleteness,clarity,consistency,andconciseness.Fathomingandtamingthemulti-layeredBigDataisadreamthatisslowlybecomingareality.Itisarapidlyevolvingfieldthatisgrowingexponentiallyinvalueandcapabilities.
ThereisagrowingnumberofbooksbeingwrittenonBigData.Theyfallmostlyintwocategories.Thefirstkindfocusonbusinessaspects,anddiscussthestrategicinternalshiftsrequiredforreapingthebusinessbenefitsfromthemanyopportunitiesofferedbyBigData.Thesecondkindfocusonparticulartechnologyplatforms,suchasHadooporSpark.Thisbookaimstobringtogetherthebusinesscontextandthetechnologiesinaseamlessway.
ThisbookwaswrittentomeettheneedsforanintroductoryBigDatacourse.Itismeantforstudents,aswellasexecutives,whowishtotakeadvantageofemergingopportunitiesinBigData.Itprovidesanintuitionofthewholenessofthefieldinasimplelanguage,freefromjargonandcode.AlltheessentialBigDatatechnologytoolsandplatformssuchasHadoop,MapReduce,Spark,andNoSqlarediscussed.MostoftherelevantprogrammingdetailshavebeenmovedtoAppendicestoensurereadability.Theshortchaptersmakeiteasytoquicklyunderstandthekeyconcepts.AcompletecasestudyofdevelopingaBigDataapplicationisincluded.
ThankstoMaharishiMaheshYogiforcreatingawonderfuluniversitywhoseconsciousness-basedenvironmentmadewritingthisevolutionarybookpossible.Thankstomanycurrentandformerstudentsforcontributingtothisbook.DheerajPandeyassistedwiththeWebloganalyzerapplicationanditsdetails.SurajThapaliaassistedwiththeHadoopinstallationguide.EnkhbilegTseeleesurenhelpedwritetheSparktutorial.Thankstomyfamilyforsupportingmeinthisprocess.MydaughtersAnkitaandNupurreviewedthebookandmadehelpfulcomments.MyfatherMr.RLMaheshwariandbrotherDr.SunilMaheshwarialsoreadthebookandenthusiasticallyapprovedit.MycolleagueDr.EdiShivajitooreviewedthebook.
MaytheBigDataForcebewithyou!
Dr.AnilMaheshwari
August2016,Fairfield,IA
ContentsPreface
Chapter1–WholenessofBigData
Introduction
UnderstandingBigData
CASELET:IBMWatson:ABigDatasystem
CapturingBigData
VolumeofData
VelocityofData
VarietyofData
VeracityofData
BenefittingfromBigData
ManagementofBigData
OrganizingBigData
AnalyzingBigData
TechnologyChallengesforBigData
StoringHugeVolumes
Ingestingstreamsatanextremelyfastpace
Handlingavarietyofformsandfunctionsofdata
Processingdataathugespeeds
ConclusionandSummary
Organizationoftherestofthebook
ReviewQuestions
LibertyStoresCaseExercise:StepB1
Section1
Chapter2-BigDataApplications
Introduction
CASELET:BigDataGetstheFlu
BigDataSources
PeopletoPeopleCommunications
SocialMedia
PeopletoMachineCommunications
Webaccess
MachinetoMachine(M2M)Communications
RFIDtags
Sensors
BigDataApplications
MonitoringandTrackingApplications
AnalysisandInsightApplications
NewProductDevelopment
Conclusion
ReviewQuestions
LibertyStoresCaseExercise:StepB2
Chapter3-BigDataArchitecture
Introduction
CASELET:GoogleQueryArchitecture
StandardBigdataarchitecture
BigDataArchitectureexamples
IBMWatson
Netflix
Ebay
VMWare
TheWeatherCompany
TicketMaster
Paypal
CERN
Conclusion
ReviewQuestions
LibertyStoresCaseExercise:StepB3
Section2
Chapter4:DistributedComputingusingHadoop
Introduction
HadoopFramework
HDFSDesignGoals
Master-SlaveArchitecture
Blocksystem
EnsuringDataIntegrity
InstallingHDFS
ReadingandWritingLocalFilesintoHDFS
ReadingandWritingDataStreamsintoHDFS
SequenceFiles
YARN
Conclusion
ReviewQuestions
Chapter5–ParallelProcessingwithMapReduce
Introduction
MapReduceOverview
MapReduceprogramming
MapReduceDataTypesandFormats
WritingMapReduceProgramming
TestingMapReducePrograms
MapReduceJobsExecution
HowMapReduceWorks
ManagingFailures
ShuffleandSort
ProgressandStatusUpdates
HadoopStreaming
Conclusion
ReviewQuestions
Chapter6–NoSQLdatabases
Introduction
RDBMSVsNoSQL
TypesofNoSQLDatabases
ArchitectureofNoSQL
CAPtheorem
PopularNoSQLDatabases
HBase
ArchitectureOverview
ReadingandWritingData
Cassandra
ArchitectureOverview
ReadingandWritingData
HiveLanguage
HIVELanguageCapabilities
PigLanguage
Conclusion
ReviewQuestions
Chapter7–StreamProcessingwithSpark
Introduction
SparkArchitecture
ResilientDistributedDatasets(RDD)
DirectedAcyclicGraph(DAG)
SparkEcosystem
Sparkforbigdataprocessing
MLlib
SparkGraphX
SparkR
SparkSQL
SparkStreaming
Sparkapplications
SparkvsHadoop
Conclusion
ReviewQuestions
Chapter8–IngestingData
Wholeness
MessagingSystems
PointtoPointMessagingSystem
Publish-SubscribeMessagingSystem
ApacheKafka
UseCases
KafkaArchitecture
Producers
Consumers
Broker
Topic
SummaryofKeyAttributes
Distribution
Guarantees
ClientLibraries
ApacheZooKeeper
KafkaProducerexampleinJava
Conclusion
ReviewQuestions
References
Chapter9–CloudComputingPrimer
Introduction
CloudComputingCharacteristics
In-housestorage
Cloudstorage
CloudComputing:EvolutionofVirtualizedArchitecture
CloudServiceModels
CloudComputingMyths
CloudComputing:GettingStarted
Conclusion
ReviewQuestions
Section3
Chapter10–WebLogAnalyzerapplicationcasestudy
Introduction
Client-ServerArchitecture
WebLoganalyzer
Requirements
SolutionArchitecture
Benefitsofthissolution
Technologystack
ApacheSpark
SparkDeployment
ComponentsofSpark
HDFS
MongoDB
ApacheFlume
OverallApplicationlogic
TechnicalPlanfortheApplication
ScalaSparkcodeforloganalysis
SampleLogdata
SampleInputData:
SampleOutputofWebLogAnalysis
ConclusionandFindings
ReviewQuestions
Chapter10:DataMiningPrimer
Gatheringandselectingdata
Datacleansingandpreparation
OutputsofDataMining
EvaluatingDataMiningResults
DataMiningTechniques
MiningBigData
FromCausationtoCorrelation
FromSamplingtotheWhole
FromDatasettoDatastream
DataMiningBestPractices
Conclusion
ReviewQuestions
Appendix1:HadoopInstallationonAmazonWebServices(AWS)ElasticComputeCluster(EC2)
CreatingClusterserveronAWS,InstallHadoopfromCloudEra
Step1:CreatingAmazonEC2Servers.
Step2:ConnectingserverandinstallingrequiredClouderadistributionofHadoop
Step3:WordCountusingMapReduce
Appendix2:SparkInstallationandTutorial
Step1:VerifyingJavaInstallation
Step2:VerifyingScalainstallation
Step3:DownloadingScala
Step4:InstallingScala
Step5:DownloadingSpark
Step6:InstallingSpark
Step7:VerifyingtheSparkInstallation
Step8:Application:WordCountinScala
AdditionalResources
AbouttheAuthor
Chapter1–WholenessofBigData
Introduction
BigDataisanall-inclusivetermthatreferstoextremelylarge,veryfast,diverse,andcomplexdatathatcannotbemanagedwithtraditionaldatamanagementtools.Ideally,BigDatawouldharnessallkindsofdata,anddelivertherightinformation,totherightperson,intherightquantity,attherighttime,tohelpmaketherightdecision.BigDatacanbemanagedbydevelopinginfinitelyscalable,totallyflexible,andevolutionarydataarchitectures,coupledwiththeuseofextremelycost-effectivecomputingcomponents.TheinfinitepotentialknowledgeembeddedwithinthiscosmiccomputerwouldhelpconnecteverythingtotheUnifiedFieldofallthelawsofnature.
ThisbookwillprovideacompleteoverviewofBigDatafortheexecutiveandthedataspecialist.ThischapterwillcoverthekeychallengesandbenefitsofBigData,andtheessentialtoolsandtechnologiesnowavailablefororganizingandmanipulatingBigData.
UnderstandingBigData
BigDatacanbeexaminedontwolevels.Onafundamentallevel,itisdatathatcanbeanalyzedandutilizedforthebenefitofthebusiness.Onanotherlevel,itisaspecialkindofdatathatposesuniquechallenges.Thisisthelevelthatthisbookwillfocuson.
Figure1‑1:BigDataContext
Atthelevelofbusiness,datageneratedbybusinessoperations,canbeanalyzedtogenerateinsightsthatcanhelpthebusinessmakebetterdecisions.Thismakesthebusinessgrowbigger,andgenerateevenmoredata,andthecyclecontinues.Thisisrepresentedbythebluecycleonthetop-rightofFigure1.1.ThisaspectisdiscussedinChapter10,aprimeronDataAnalytics.
Onanotherlevel,BigDataisdifferentfromtraditionaldataineveryway:space,time,andfunction.ThequantityofBigDatais1,000timesmorethanthatoftraditionaldata.Thespeedofdatagenerationandtransmissionis1,000timesfaster.TheformsandfunctionsofBigDataaremuchmorediverse:fromnumberstotext,pictures,audio,videos,activitylogs,machinedata,andmore.Therearealsomanymoresourcesofdata,fromindividualstoorganizationstogovernments,usingarangeofdevicesfrommobilephonestocomputerstoindustrialmachines.Notalldatawillbeofequalqualityandvalue.ThisisrepresentedbytheredcycleonthebottomleftofFigure1.1.ThisaspectofBigData,anditsnewtechnologies,isthemainfocusofthisbook.
BigDataismostlyunstructureddata.Everytypeofdataisstructureddifferently,andwillhavetobedealtwithdifferently.TherearehugeopportunitiesfortechnologyproviderstoinnovateandmanagetheentirelifecycleofBigData…togenerate,gather,store,organize,analyze,andvisualizethisdata.
CASELET:IBMWatson:ABigDatasystemIBMcreatedtheWatsonsystemasawayofpushingtheboundariesofArtificialIntelligenceandnaturallanguageunderstandingtechnologies.WatsonbeattheworldchampionhumanplayersofJeopardy(quizstyleTVshow)inFeb2011.WatsonreadsupondataabouteverythingonthewebincludingtheentireWikipedia.Itdigestsandabsorbsthedatabasedonsimplegenericrulessuchas:bookshaveauthors;storieshaveheroes;anddrugstreatailments.Ajeopardyclue,receivedintheformofacrypticphrase,isbrokendownintomanypossiblepotentialsub-cluesofthecorrectanswer.Eachsub-clueisexaminedtoseethelikelinessofitsanswerbeingthecorrectanswerforthemainproblem.Watsoncalculatestheconfidencelevelofeachpossibleanswer.Iftheconfidencelevelreachesmorethanathresholdlevel,itdecidestooffertheanswertotheclue.Itmanagestodoallthisinamere3seconds.
Watsonisnowbeingappliedtodiagnosingdiseases,especiallycancer.Watsoncanreadallthenewresearchpublishedinthemedicaljournalstoupdateitsknowledgebase.Itisbeingusedtodiagnosetheprobabilityofvariousdiseases,byapplyingfactorssuchaspatient’scurrentsymptoms,healthhistory,genetichistory,medicationrecords,andotherfactorstorecommendaparticulardiagnosis.(Source:SmartestmachinesonEarth:youtube.com/watch?v=TCOhyaw5bwg)
Figure1.2:IBMWatsonplayingJeopardy
Q1:WhatkindsofBigDataknowledge,technologiesandskillsarerequiredtobuildasystemlikeWatson?Whatkindofresourcesareneeded?
Q2:WilldoctorsbeabletocompetewithWatsonindiagnosingdiseases
andprescribingmedications?WhoelsecouldbenefitfromasystemlikeWatson?
CapturingBigDataIfdataweresimplygrowingtoolarge,ORonlymovingtoofast,ORonlybecomingtoodiverse,itwouldberelativelyeasy.However,whenthefourVs(Volume,Velocity,Variety,andVeracity)arrivetogetherinaninteractivemanner,itcreatesaperfectstorm.WhiletheVolumeandVelocityofdatadrivethemajortechnologicalconcernsandthe
costsofmanagingBigData,thesetwoVsarethemselvesbeingdrivenbythe3rdV,theVarietyofformsandfunctionsandsourcesofdata.
VolumeofData
Thequantityofdatahasbeenrelentlesslydoublingevery12-18months.TraditionaldataismeasuredinGigabytes(GB)andTerabytes(TB),butBigDataismeasuredinPetabytes(PB)andExabytes(1Exabyte=1MillionTB).
Thisdataissohugethatitisalmostamiraclethatonecanfindanyspecificthinginit,inareasonableperiodoftime.Searchingtheworld-widewebwasthefirsttrueBigDataapplication.Googleperfectedtheartofthisapplication,anddevelopedmanyofthepath-breakingtechnologiesweseetodaytomanageBigData.
Theprimaryreasonforthegrowthofdataisthedramaticreductioninthecostofstoringdata.Thecostsofstoringdatahavedecreasedby30-40%everyyear.Therefore,thereisanincentivetorecordeverythingthatcanbeobserved.Itiscalled‘datafication’oftheworld.Thecostsofcomputationandcommunicationhavealsobeencomingdown,similarly.Anotherreasonforthegrowthofdataistheincreaseinthenumberofformsandfunctionsofdata.MoreaboutthisintheVarietysection.
VelocityofData
Iftraditionaldataislikealake,BigDataislikeafast-flowingriver.BigDataisbeinggeneratedbybillionsofdevices,andcommunicatedatthespeedoftheinternet.Ingestingallthisdataislikedrinkingfromafirehose.Onedoesnothavecontroloverhowfastthedatawillcome.Ahugeunpredictabledata-streamisthenewmetaphorforthinkingaboutBigData.
Theprimaryreasonfortheincreasedvelocityofdataistheincreaseininternetspeed.Internetspeedsavailabletohomesandofficesarenowincreasingfrom10MB/secto1GB/sec(100timesfaster).Morepeoplearegettingaccesstohigh-speedinternetaroundtheworld.Anotherimportantreasonistheincreasedvarietyofsourcesthatcangenerateandcommunicatedatafromanywhere,atanytime.MoreonthatintheVarietysection.
VarietyofData
Bigdataisinclusiveofallformsofdata,forallkindsoffunctions,fromallsourcesanddevices.Iftraditionaldata,suchasinvoicesandledgerswerelikeasmallstore,BigDataisthebiggestimaginableshoppingmallthatoffersunlimitedvariety.Therearethreemajorkindsofvariety.
1. Thefirstaspectofvarietyistheformofdata.Datatypesrangeinorderofsimplicityandsizefromnumberstotext,graph,map,audio,video,andothers.Therecouldbeacompositeofdatathatincludesmanyelementsinasinglefile.Forexample,textdocumentshavetextandgraphsandpicturesembeddedinthem.Videocanhavechartsandsongsembeddedinthem.Audioandvideohavedifferentandmorecomplexstorageformatsthannumbersandtext.Numbersandtextcanbemoreeasilyanalyzedthananaudioorvideofile.Howshouldcompositeentitiesbestoredandanalyzed?
2. Thesecondaspectisthevarietyoffunctionofdata.Therearehumanchatsandconversationdata,songsandmoviesforentertainment,businesstransactionrecords,machineoperationsperformancedata,newproductdesigndata,olddataforbackup,etc.Humancommunicationdatawouldbeprocessedverydifferentlyfromoperationalperformancedata,withtotallydifferentobjectives.Avarietyofapplicationsareneededtocomparepicturesinordertorecognizepeople’sfaces;comparevoicestoidentifythespeaker;andcomparehandwritingstoidentifythewriter.
3. Thethirdaspectofvarietyisthesourceofdata.Mobilephonesandtabletdevicesenableawideseriesofapplicationsorappstoaccessdataandgeneratedatafromanytimeanywhere.Webaccesslogsareanothernewandhugesourceofdiagnosticdata.ERPsystemsgeneratemassiveamountsofstructuredbusinesstransactionalinformation.Sensorsonmachines,andRFIDtagsonassets,generateincessantandrepetitivedata.Broadlyspeaking,therearethreebroadtypesofsourcesofdata:Human-humancommunications;human-machinecommunications;andmachine-to-machinecommunications.Thesourcesofdata,andtheirrespectiveapplicationsarisingfromthatdata,willbediscussedinthenextchapter.
Figure1.3SourcesofBigData(Source:Hortonworks.com)VeracityofData
Veracityrelatestothebelievabilityandqualityofdata.BigDataismessy.Thereisalotofmisinformationanddisinformation.Thereasonsforpoorqualityofdatacanrangefromhumanandtechnicalerror,tomaliciousintent.
1. Thesourceofinformationmaynotbeauthoritative.Forexample,allwebsitesarenotequallytrustworthy.Anyinformationfromwhitehouse.govorfromnytimes.comismorelikelytobeauthenticandcomplete.Wikipediaisuseful,butnotallpagesareequallyreliable.Thecommunicatormayhaveanagendaorapointofview.
2. Thedatamaynotbereceivedcorrectlybecauseofhumanortechnicalfailure.Sensorsandmachinesforgatheringandcommunicatingdatamaymalfunctionandmayrecordandtransmitincorrectdata.Urgencymayrequirethetransmissionofthebestdataavailableatapointintime.Suchdatamakesreconciliationwithlater,accurate,recordsmoreproblematic.
3. Thedataprovidedandreceived,mayhowever,alsobeintentionallywrong,forcompetitiveorsecurityreasons.
Dataneedstobesiftedandorganizedbyqualityfactors,forittobeputtoanygreatuse.
BenefittingfromBigDataDatausuallybelongstotheorganizationthatgeneratesit.Thereisotherdata,suchassocialmediadata,thatisfreelyaccessibleunderanopengenerallicense.Organizationscanusethisdatatolearnabouttheirconsumers,improvetheirservicedelivery,anddesignnewproductstodelighttheircustomersandtogainacompetitiveadvantage.Dataisalsolikeanewnaturalresource.Itisbeingusedtodesignnewdigitalproducts,suchason-demandentertainmentandlearning.
Organizationsmaychoosetogatherandstorethisdataforlateranalysis,ortosellittootherorganizations,whomightbenefitfromit.Theymayalsolegitimatelychoosetodiscardpartsoftheirdataforprivacyorlegalreasons.However,organizationscannotaffordtoignoreBigData.OrganizationsthatdonotlearntoengagewithBigData,couldfindthemselvesleftfarbehindtheircompetition,landinginthedustbinofhistory.InnovativesmallandneworganizationscanuseBigDatatoquicklyscaleupandbeatlargerandmorematureorganizations.
BigDataapplicationsexistinallindustriesandaspectsoflife.TherearethreemajortypesofBigDataapplications:MonitoringandTracking,AnalysisandInsight,andnewdigitalproductdevelopment.
MonitoringandTrackingApplications:Consumergoodsproducersusemonitoringandtrackingapplicationstounderstandthesentimentsandneedsoftheircustomers.IndustrialorganizationsuseBigDatatotrackinventoryinmassiveinterlinkedglobalsupplychains.Factoryownersuseittomonitormachineperformanceanddopreventivemaintenance.Utilitycompaniesuseittopredictenergyconsumption,andmanagedemandandsupply.InformationTechnologycompaniesuseittotrackwebsiteperformanceandimproveitsusefulness.Financialorganizationsuseittoprojecttrendsbetterandmakemoreeffectiveandprofitablebets,etc.
AnalysisandInsight:PoliticalorganizationsuseBigDatatomicro-targetvotersandwinelections.PoliceuseBigDatatopredictandpreventcrime.Hospitalsuseittobetterdiagnosediseasesandmakemedicineprescriptions.Adagenciesuseittodesignmoretargetedmarketingcampaignsquickly.Fashiondesignersuseittotracktrendsandcreatemoreinnovativeproducts.
Figure1.4:ThefirstBigDataPresident
NewProductDevelopment:IncomingdatacouldbeusedtodesignnewproductssuchasrealityTVentertainment.Stockmarketfeedscouldbeadigitalproduct.Thisareaneedsmuchmoredevelopment.
ManagementofBigDataManyorganizationshavestartedinitiativesaroundtheuseofBigData.However,mostorganizationsdonotnecessarilyhaveagriponit.HerearesomeemerginginsightsintomakingbetteruseofBigData.
1. Acrossallindustries,thebusinesscaseforBigDataisstronglyfocusedonaddressingcustomer-centricobjectives.ThefirstfocusondeployingBigDatainitiativesistoprotectandenhancecustomerrelationshipsandcustomerexperience.
2. Solvearealpain-point.BigDatashouldbedeployedforspecificbusinessobjectivesinordertohavemanagementavoidbeingoverwhelmedbythesheersizeofitall.
3. Organizationsarebeginningtheirpilotimplementationsbyusingexistingandnewlyaccessibleinternalsourcesofdata.Itisbettertobeginwithdataunderone’scontrolandwhereonehasasuperiorunderstandingofthedata.
4. Puthumansanddatatogethertogetthemostinsight.Combiningdata-basedanalysiswithhumanintuitionandperspectivesisbetterthangoingjustoneway.
5. Advancedanalyticalcapabilitiesarerequired,butlacking,fororganizationstogetthemostvaluefromBigData.Thereisagrowingawarenessofbuildingorhiringthoseskillsandcapabilities.
6. Usemorediversedata,notjustmoredata.Thiswouldprovideabroaderperspectiveintorealityandbetterqualityinsights.
7. Thefasteryouanalyzethedata,themoreitspredictivevalue.Thevalueofdatadepreciateswithtime.Ifthedataisnotprocessedinfiveminutes,thentheimmediateadvantageislost.
8. Don’tthrowawaydataifnoimmediateusecanbeseenforit.Datahasvaluebeyondwhatyouinitiallyanticipate.Datacanaddperspectivetootherdatalateroninamultiplicativemanner.
9. Maintainonecopyofyourdata,notmultiple.Thiswouldhelpavoidconfusionandincreaseefficiency.
10. Planforexponentialgrowth.Dataisexpectedtocontinuetogrowatexponentialrates.Storagecostscontinuetofall,datagenerationcontinuestogrow,data-basedapplicationscontinuetogrowincapabilityandfunctionality.
11. Ascalableandextensibleinformationmanagementfoundationisaprerequisiteforbigdataadvancement.BigDatabuildsuponaresilient,secure,efficient,flexible,andreal-timeinformationprocessingenvironment.
12. BigDataistransformingbusiness,justlikeITdid.BigDataisanewphaserepresentingadigitalworld.Businessandsocietyarenotimmunetoitsstrongimpacts.
OrganizingBigData
Goodorganizationdependsuponthepurposeoftheorganization.
Givenhugequantities,itwouldbedesirabletoorganizethedatatospeedupthesearchprocessforfindingaspecific,adesiredthingintheentiredata.Thecostofstoringandprocessingthedata,too,wouldbeamajordriverforthechoiceofanorganizingpattern.
Giventhefastspeedofdata,itwouldbedesirabletocreateascalablenumberofingestpoints.Itwillalsobedesirabletocreateatleastathinveneerofcontroloverthedatabymaintainingcountandaveragesovertime,uniquevaluesreceived,etc.
Giventhevarietyinformfactors,dataneedstobestoredandanalyzeddifferently.Videosneedtobestoredseparatelyandusedforservinginastreamingmode.Textdatamaybecombined,cleaned,andvisualizedforthemesandsentiments.
Givendifferentqualitylevelsofdata,variousdatasourcesmayneedtoberankedandprioritizedbeforeservingthemtotheaudience.Forexample,thequalityofawebpagemaybecomputedthroughaPageRankmechanism.
AnalyzingBigData
BigDatacanbeanalyzedintwoways.ThesearecalledanalyzingBigDatainmotionorBigDataatrest.Firstwayistoprocesstheincomingstreamofdatainrealtimeforquickandeffectivestatisticsaboutthedata.Theotherwayistostoreandstructurethedataandapplystandardanalyticaltechniquesonbatchesofdataforgeneratinginsights.Thiscouldthenbevisualizedusingreal-timedashboards.BigDatacanbeutilizedtovisualizeaflowingorastaticsituation.Thenatureofprocessingthishuge,diverse,andlargelyunstructureddata,canbelimitedonlybyone’simagination.
Figure1.5:BigDataArchitecture
Amillionpointsofdatacanbeplottedinagraphandofferaviewofthedensityofdata.However,plottingamillionpointsonthegraphmayproduceablurredimagewhichmayhide,ratherthanhighlightthedistinctions.Insuchacase,binningthedatawouldhelp,orselectingthetopfewfrequentcategoriesmaydelivergreaterinsights.Streamingdatacanalsobevisualizedbysimplecountsandaveragesovertime.Forexample,belowisadynamicallyupdatedchartthatshowsup-to-datestatisticsofvisitortraffictomyblogsite,anilmah.com.Thebarshowsthenumberofpageviews,andtheinnerdarkerbarshowsthenumberofuniquevisitors.Thedashboardcouldshowtheviewbydays,weeksoryearsalso.
Figure1.6:Real-timeDashboardforwebsiteperformancefortheauthor’sblog
TextDatacouldbecombined,filtered,cleaned,thematicallyanalyzed,andvisualizedinawordcloud.Hereiswordcloudfromarecentstreamoftweets(ieTwittermessages)fromUSPresidentialcandidatesHillaryClintonandDonaldTrump.Thelargerwordsimpliesgreaterfrequencyofoccurrenceinthetweets.Thiscanhelpunderstandthemajortopicsofdiscussionbetweenthetwo.
Figure1.7:AwordcloudofHillaryClinton’sandDonaldTrump’stweets
TechnologyChallengesforBigData
Therearefourmajortechnologicalchallenges,andmatchinglayersoftechnologiestomanageBigData.
StoringHugeVolumes
Thefirstchallengerelatestostoringhugequantitiesofdata.Nomachinecanbebigenoughtostoretherelentlesslygrowingquantityofdata.Therefore,dataneedstobestoredinalargenumberofsmallerinexpensivemachines.However,withalargenumberofmachines,thereistheinevitablechallengeofmachinefailure.Eachofthesecommoditymachineswillfailatsomepointoranother.Failureofamachinecouldentailalossofdatastoredonit.
ThefirstlayerofBigDatatechnologyhelpsstorehugevolumesofdata,whileavoidingtheriskofdataloss.Itdistributesdataacrossthelargeclusterofinexpensivecommoditymachines,andensuresthateverypieceofdataisstoredonmultiplemachinestoguaranteethatatleastonecopyisalwaysavailable.Hadoopisthemostwell-knownclusteringtechnologyforBigData.ItsdatastoragepatterniscalledHadoopDistributedFileSystem(HDFS).ThissystemisbuiltonthepatternsofGoogle’sFilesystems,designedtostorebillionsofpagesandsortthemtoanswerusersearchqueries.
Ingestingstreamsatanextremelyfastpace
ThesecondchallengerelatestotheVelocityofdata,i.e.handlingtorrentialstreamsofdata.Someofthemmaybetoolargetostore,butmuststillbeingestedandmonitored.Thesolutionliesincreatingspecialingestingsystemsthatcanopenanunlimitednumberofchannelsforreceivingdata.Thesequeuingsystemscanholddata,fromwhichconsumerapplicationscanrequestandprocessdataattheirownpace.
BigDatatechnologymanagesthisvelocityproblem,usingaspecialstream-processingengine,whereallincomingdataisfedintoacentralqueueingsystem.Fromthere,afork-shapedsystemsendsdatatobatchprocessingaswellastostreamprocessingdirections.Thestreamprocessingenginecandoitsworkwhilethebatchprocessingdoesitswork.ApacheSparkisthemostpopularsystemforstreamingapplications.
Handlingavarietyofformsandfunctionsofdata
ThethirdchallengerelatestothestructuringandaccessofallvarietiesofdatathatcompriseBigData.Storingthemintraditionalflatorrelationalfilestructureswouldbetoowastefulandslow.ThethirdlayerofBigDatatechnologysolvesthisproblemby
storingthedatainnon-relationalsystemsthatrelaxmanyofthestringentconditionsoftherelationalmodel.ThesearecalledNoSQL(NotOnlySQL)databases.
HBaseandCassandraaretwoofthebetterknownNoSQLdatabasessystems.HBase,forexample,storeseachdataelementseparatelyalongwithitskeyidentifyinginformation.Thisiscalledakey-valuepairformat.Cassandrastoresdatainadocumentformat.TherearemanyothervariantsofNoSQLdatabases.NoSQLlanguages,suchasPigandHive,areusedtoaccessthisdata.
Processingdataathugespeeds
Thefourthchallengerelatestomovinglargeamountsofdatafromstoragetotheprocessor,asthiswouldconsumeenormousnetworkcapacityandchokethenetwork.Thealternativeandinnovativemodewouldbetomovetheprocessortothedata.
ThesecondlayerofBigDatatechnologyavoidsthechokingofthenetwork.Itdistributesthetasklogicthroughouttheclusterofmachineswherethedataisstored.Thosemachineswork,inparallel,onthedataassignedtothem,respectively.Afollow-upprocessconsolidatestheoutputsofallthesmalltasksanddeliversthefinalresults.MapReduce,alsoinventedbyGoogle,isthebest-knowntechnologyforparallelprocessingofdistributedBigData.
Table1.1:TechnologicalchallengesandsolutionsforBigData
Challenge Description Solution Technology
Volume Avoidriskofdatalossfrommachinefailureinclustersofcommoditymachines
Replicatesegmentsofdatainmultiplemachines;masternodekeepstrackofsegmentlocation
HDFS
Volume&Velocity
Avoidchokingofnetworkbandwidthbymovinglargevolumesofdata
Moveprocessinglogictowherethedataisstored;manageusingparallelprocessingalgorithms
Map-Reduce
Variety Efficientstorageoflargeandsmalldataobjects
Columnardatabasesusingkey-pairvaluesformat
HBase,Cassandra
Velocity Monitoringstreamstoolargetostore
Fork-shapedarchitecturetoprocessdataasstreamandasbatch
Spark
Oncethesemajortechnologicalchallengesaremet,alltraditionalanalyticalandpresentationtoolscanbeappliedtoBigData.TherearemanyadditionalsupportivetechnologiestomakethetaskofmanagingBigDataeasier.Forexample,aresourcemanager(suchasYARN)canhelpmonitortheresourceusageandloadbalancingofthemachinesinthecluster.
ConclusionandSummary
BigDataisamajorphenomenonthatimpactseveryone,andisanopportunitytocreatenewwaysofworking.BigDataisextremelylarge,complex,fast,andnotalwaysclean,itisdatathatcomesfrommanysourcessuchaspeople,web,andmachinecommunications.Itneedstobegathered,organizedandprocessedinacost-effectivewaythatmanagesthevolume,velocity,varietyandveracityofBigData.HadoopandSparksystemsarepopulartechnologicalplatformsforthispurpose.HereisalistofthemanydifferencesbetweentraditionalandBigData.
Table1.2:ComparingBigDatawithTraditionalData
Feature TraditionalData BigData
RepresentativeStructure Lake/Pool FlowingStream/river
PrimaryPurpose Managebusinessactivities Communicate,Monitor
Sourceofdata Businesstransactions,documents
Socialmedia,Webaccesslogs,machinegenerated
Volumeofdata Gigabytes,Terabytes Petabytes,Exabytes
Velocityofdata Ingestleveliscontrolled Real-timeunpredictableingest
Varietyofdata Alphanumeric Audio,Video,Graphs,Text
Veracityofdata Clean,moretrustworthy Variesdependingonsource
Structureofdata Well-Structured Semi-orUn-structured
PhysicalStorageofData
InaStorageAreaNetwork
Distributedclustersofcommoditycomputers
Databaseorganization Relationaldatabases NoSQLdatabases
DataAccess SQL NoSQLsuchasPig
DataManipulationConventionaldataprocessing Parallelprocessing
Dynamicdashboardswithsimple
DataVisualization Varietyoftools measures
DatabaseTools Commercialsystems Open-source-Hadoop,Spark
TotalCostofSystem MediumtoHigh high
OrganizationoftherestofthebookThisbookwillcoverapplications,architectures,andtheessentialBigDatatechnologies.Therestofthebookisorganizedasfollows.
Section1willdiscusssources,applications,andarchitecturaltopics.Chapter2willdiscussafewcompellingbusinessapplicationsofBigData,basedontheunderstandingofthedifferentsourcesandformatsofdata.Chapter3willcoversomeexamplesofarchitecturesusedbymanyBigDataapplications.
Section2willdiscussthesixmajortechnologyelementsidentifiedintheBigDataEcosystem(Figure1.5).Chapter4willdiscussHadoopandhowitsDistributedFilesystem(HDFS)works.Chapter5willdiscussMapReduceandhowthisparallelprocessingalgorithmworks.Chapter6willdiscussNoSQLdatabasestolearnhowtostructurethedataintodatabasesforfastaccess.PigandHivelanguages,fordataaccess,willbeincluded.Chapter7willcoverstreamingdata,andthesystemsforingestingandprocessingthisdata.ThischapterwillcoverSpark,anintegrated,in-memoryprocessingtoolsettomanageBigData.Chapter8willcoverDataingestsystem,withApacheKafka.Chapter9willbeaprimeronCloudComputingtechnologiesusedforrentingstorageandcomputersatthirdpartylocations.
Section3willincludePrimersandtutorials.Chapter10willpresentacasestudyonthewebloganalyzer,anapplicationthatingestsalogofalargenumberofwebrequestentrieseverydayandcancreatesummaryandexceptionreports.Chapter11willbeaprimerondataanalyticstechnologiesforanalyzingdata.Afulltreatmentcanbefoundinmybook,DataAnalyticsMadeAccessible.Appendix1willbeatutorialoninstallingHadoopclusteronAmazonEC2cloud.Appendix2willbeatutorialoninstallingandusingSpark.
ReviewQuestionsQ1.WhatisBigData?Whyshouldanyonecare?
Q2.Describethe4VmodelofBigData.
Q3.WhatarethemajortechnologicalchallengesinmanagingBigData?
Q4:WhatarethetechnologiesavailabletomanageBigData?
Q5.WhatkindofanalysescanbedoneonBigData?
Q6:WatchClouderaCEOpresenttheevolutionofHadoopathttps://www.youtube.com/watch?v=S9xnYBVqLws.WhydidpeoplenotpayattentiontoHadoopandMapReducewhenitwasintroduced?Whatimplicationsdoesithavetoemergingtechnologies?
LibertyStoresCaseExercise:StepB1LibertyStoresInc.isaspecializedglobalretailchainthatsellsorganicfood,organicclothing,wellnessproducts,andeducationproductstoenlightenedLOHAS(LifestylesoftheHealthyandSustainable)citizensworldwide.Thecompanyis20yearsold,andisgrowingrapidly.Itnowoperatesin5continents,50countries,150cities,andhas500stores.Itsells20000productsandhas10000employees.Thecompanyhasrevenuesofover$5billionandhasaprofitofabout5%ofitsrevenue.Thecompanypaysspecialattentiontotheconditionsunderwhichtheproductsaregrownandproduced.Itdonatesaboutone-fifth(20%)fromitspre-taxprofitsfromgloballocalcharitablecauses.
Q1:CreateacomprehensiveBigDatastrategyfortheCEOofthecompany.
Q2:HowcanBigDatasystemssuchasIBMWatsonhelpthiscompany?
Section1
Thissectioncoversthreeimportanthigh-leveltopics.
Chapter2willcoverbigdatasources,andmanyapplicationsinmanyindustries.
Chapter3willarchitecturesformanagingbigdata
Chapter2-BigDataApplications
IntroductionIfatraditionalsoftwareapplicationisalovelycat,thenaBigDataapplicationisapowerfultiger.AnidealBigDataapplicationwilltakeadvantageofalltherichnessofdataandproducerelevantinformationtomaketheorganizationresponsiveandsuccessful.BigDataapplicationscanaligntheorganizationwiththetotalityofnaturallaws,thesourceofallsuccess.
Companiesliketheconsumergoodsgiant,Proctor&Gamble,haveinsertedBigDataintoallaspectsofitsplanningandoperations.Theindustrialgiant,Volkswagen,asksallitsbusinessunitstoidentifysomerealisticinitiativeusingBigDatatogrowtheirunit’ssales.Theentertainmentgiant,Netflix,processes400billionuseractionseveryday,andthesearesomeofthebiggestusersofBigData.
Figure2‑0‑1:BigDataapplicationisapowerfultiger(Source:Flickr.com)
CASELET:BigDataGetstheFluGoogleFluTrendswasanenormouslysuccessfulinfluenzaforecastingservice,pioneeredbyGoogle.ItemployedBigData,suchasthestreamofsearchtermsusedinitsubiquitousInternetsearchservice.TheprogramaimedtobetterpredictfluoutbreaksusingdataandinformationfromtheU.S.CentersforDiseaseControlandPrevention(CDC).Whatwasmostamazingwasthatthisapplicationwasabletopredicttheonsetofflu,almosttwoweeksbeforeCDCsawitcoming.From2004tillabout2012itwasabletosuccessfullypredictthetimingandgeographicallocationofthearrivalofthefluseasonaroundtheworld.
Figure2‑0‑2:GoogleFlutrends
However,itfailedspectacularlytopredictthe2013fluoutbreak.DatausedtopredictEbola’sspreadin2014-15yieldedwildlyinaccurateresults,andcreatedamajorpanic.Newspapersacrosstheglobespreadthisapplication’sworst-casescenariosfortheEbolaoutbreakof2014.
GoogleFluTrendsfailedfortworeasons:BigDatahubris,andalgorithmicdynamics,(a)Thequantityofdatadoesnotmeanthatonecanignorefoundationalissuesofmeasurementandconstructvalidityandreliabilityanddependenciesamongdataand(b)GoogleFluTrendspredictionswerebasedonacommercialsearchalgorithmthatfrequentlychanges,basedonGoogle’sbusinessgoals.ThisuncertaintyskewedthedatainwaysevenGoogleengineersdidnotunderstand,evenskewingtheaccuracyofpredictions.Perhapsthebiggestlessonisthatthereisfarlessinformationinthedata,typicallyavailableintheearlystagesofanoutbreak,thanisneededtoparameterizethetestmodels.
Q1:WhatlessonswouldyoulearnfromthedeathofaprominentandhighlysuccessfulBigDataapplication?
Q2:WhatotherBigDataapplicationscanbeinspiredfromthesuccessofthisapplication?
BigDataSourcesBigDataisinclusiveofalldataaboutallactivitieseverywhere.Itcan,thus,potentiallytransformourperspectiveonlifeandtheuniverse.Itbringsnewinsightsinreal-timeandcanmakelifehappierandmaketheworldmoreproductive.BigDatacan,however,alsobringperils—intermsofviolationofprivacy,andsocialandeconomicdisruption.
Therearethreemajorcategoriesofdatasources:humancommunications,human-machinecommunications,andmachine-machinecommunications.
PeopletoPeopleCommunicationsPeopleandcorporationsincreasinglycommunicateoverelectronicnetworks.Distanceandtimehavebeenannihilated.Everyonecommunicatesthroughphoneandemail.Newstravelsinstantly.Influentialnetworkshaveexpanded.Thecontentofcommunicationhasbecomericherandmultimedia.High-resolutioncamerasinmobilephonesenablepeopletotakepicturesandvideos,andinstantlysharethemwithfriendsandfamily.Allthesecommunicationsarestoredinthefacilitiesofmanyintermediaries,suchastelecomandinternetserviceproviders.Socialmediaisanew,butparticularlytransformativetypeofhuman-humancommunications.
SocialMedia
SocialmediaplatformssuchasFacebook,Twitter,LinkedIn,YouTube,Flickr,Tumblr,Skye,Snapchat,andothershavebecomeanincreasinglyintimatepartofmodernlife.Theseareamongthehundredsofsocialmediathatpeopleuseandtheygeneratehugestreamsoftext,pictures,videos,logs,andothermultimediadata.
PeoplesharemessagesandpicturesthroughsocialmediasuchasFacebookandYouTube.TheysharephotoalbumsthroughFlickr.TheycommunicateinshortasynchronousmessageswitheachotheronTwitter.TheymakefriendsonFacebook,andfollowothersonTwitter.Theydovideoconferencing,usingSkypeandleadersdelivermessagesthatsometimesgoviralthroughsocialmedia.AllthesedatastreamsarepartofBigData,andcanbemonitoredandanalyzedtounderstandmanyphenomena,suchaspatternsofcommunication,aswellasthegistoftheconversations.Thesemediahavebeenusedforawidevarietyofpurposeswithstunningeffects.
Figure2‑0‑3:Samplingofmajorsocialmedia
PeopletoMachineCommunicationsSensorsandwebaretwoofthekindsofmachinesthatpeoplecommunicatewith.PersonalassistantssuchasSiriandCortanaarethelatestinman-machinecommunicationsastheytrytounderstandhumanrequestsinnaturallanguage,andfulfilthem.WearabledevicessuchasFitBitandsmartwatcharesmartdevicesthatread,storeandanalyzepeople’spersonaldatasuchasbloodpressureandweight,foodandexercisedata,andsleeppatterns.Theworld-widewebislikeaknowledgemachinethatpeopleinteractwithtogetanswersfortheirqueries.
Webaccess
Theworld-wide-webhasintegrateditselfintoallpartsofhumanandmachineactivity.Theusageofthetensofbillionsofpagesbybillionsofwebusersgenerateshugeamountofenormouslyvaluableclickstreamdata.Everytimeawebpageisrequested,alogentryisgeneratedattheproviderend.Thewebpageprovidertrackstheidentityoftherequestingdeviceanduser,andtimeandspatiallocationofeachrequest.Ontherequesterside,therearecertainsmallpiecesofcomputercodeanddatacalledcookieswhichtrackthewebpagesreceived,date/timeofaccess,andsomeidentifyinginformationabouttheuser.Allthewebaccesslogs,andcookierecords,canprovidewebusagerecordsthatcanbeanalyzedfordiscoveringopportunitiesformarketingpurposes.
Awebloganalyzerisanapplicationrequiredtomonitorstreamingwebaccesslogsinreal-timetocheckonwebsitehealthandtoflagerrors.Adetailedcasestudyofapracticaldevelopmentofthisapplicationisshowninchapter8.
MachinetoMachine(M2M)CommunicationsM2McommunicationsisalsosometimescalledtheInternetofThings(IoT).Atrilliondevicesareconnectedtotheinternetandtheycommunicatewitheachotherorsomemastermachines.Allthisdatacanbeaccessedandharnessedbymakersandownersofthosemachines.
Machinesandequipmenthavemanykindsofsensorstomeasurecertainenvironmentalparameters,whichcanbebroadcasttocommunicatetheirstatus.RFIDtagsandsensorsembeddedinmachineshelpgeneratethedata.ContainersonshipsaretaggedwithRFIDtagsthatconveytheirlocationtoallthosewhocanlisten.Similarly,whenpalletsofgoodsaremovedinwarehousesorlargeretainstores,thosepalletscontainelectromagnetic(RFID)tagsthatconveytheirlocation.CarscarryanRFIDtranspondertoidentifythemselvestoautomatedtollboothsandpaythetolls.Robotsinafactory,andinternet-connectedrefrigeratorsinahouse,continuallybroadcasta‘heartbeat’thattheyarefunctionallynormally.Surveillancevideosusingcommoditycamerasareanothermajorsourceofmachine-generateddata.
Automobilescontainsensorsthatrecordandcommunicateoperationaldata.Amoderncarcangeneratemanymegabytesofdataeveryday,andtherearemorethan1billionmotorvehiclesontheroad.Thustheautomotiveindustryitselfgeneratehugeamountsofdata.Self-drivingcarswouldonlyaddtothequantityofdatagenerated.
RFIDtags
AnRFIDtagisaradiotransmitterwithalittleantennathatcanrespondtoandcommunicateessentialinformationtospecialreadersthroughRadioFrequency(RF)channel.Afewyearsago,majorretailerssuchasWalmartdecidedtoinvestinRFIDtechnologytotaketheretailindustrytoanewlevel.ItforcedtheirsupplierstoinvestinRFIDtagsonthesuppliedproducts.Today,almostallretailersandmanufacturershaveimplementedRFID-tagsbasedsolutions.
Figure2‑0‑4:AsmallpassiveRFIDtag
HereishowanRFIDtagworks.WhenapassiveRFIDtagcomesinthevicinityofanRFreaderandis‘tickled’,thetagrespondsbybroadcastingafixedidentifyingcode.An
activeRFIDtaghasitsownbatteryandstorage,andcanstoreandcommunicatealotmoreinformation.EveryreadingofmessagefromanRFIDtagbyanRFreadercreatesalogentry.ThusthereisasteadystreamofdatafromeveryreaderasitrecordsinformationaboutalltheRFIDtagsinitsareaofinfluence.Therecordsmaybeloggedregularly,andthustherewillbemanymorerecordsthanarenecessarytotrackthelocationandmovementofanitem.Alltheduplicateandredundantrecordsisremoved,toproduceclean,consolidateddataaboutthelocationandstatusofitems.
Sensors
Asensorisasmalldevicethatcanobserveandrecordphysicalorchemicalparameters.Sensorsareeverywhere.Aphotosensorintheelevatorortraindoorcansenseifsomeoneismovingandtothuskeepthedoorfromclosing.ACCTVcameracanrecordavideoforsurveillancepurposes.AGPSdevicecanrecorditsgeographicallocationeverymoment.
Figure2‑0‑5:Anembeddedsensor
Temperaturesensorsinacarcanmeasurethetemperatureoftheengineandthetiresandmore.Thethermostatinabuildingorarefrigeratortoohavetemperaturesensors.Apressuresensorcanmeasurethepressureinsideanindustrialboiler.
BigDataApplicationsMonitoringandTrackingApplicationsPublicHealthMonitoring
TheUSgovernmentisencouragingallhealthcarestakeholderstoestablishanationalplatformforinteroperabilityanddatasharingstandards.Thiswouldenablesecondaryuseofhealthdata,whichwouldadvanceBigDataanalyticsandpersonalizedholisticprecisionmedicine.Thiswouldbeabroad-basedplatformliketheGoogleFluTrendscase.
ConsumerSentimentMonitoring
SocialMediahasbecomemorepowerfulthanadvertising.Manyconsumergoodscompanieshavemovedabulkoftheirmarketingbudgetsfromtraditionaladvertisingmediaintosocialmedia.TheyhavesetupBigDatalisteningplatforms,whereSocialMediadatastreams(includingtweetsandFacebookpostsandblogposts)arefilteredandanalyzedforcertainkeywordsorsentiments,bycertaindemographicsandregions.Actionableinformationfromthisanalysisisdeliveredtomarketingprofessionalsforappropriateaction,especiallywhentheproductisnewtothemarket.
Figure2‑0‑6:ArchitectureforaListeningPlatform(source:Intelligenthq.com)
Assettracking
TheUSDepartmentofDefenseisencouragingtheindustrytodeviseatinyRFIDchipthatcouldpreventthecounterfeitingofelectronicpartsthatendupinavionicsorcircuitboardsforotherdevices.Airplanesareoneoftheheaviestusersofsensorswhichtrackeveryaspectoftheperformanceofeverypartoftheplane.Thedatacanbedisplayedon
thedashboard,aswellasstoredforlaterdetailedanalysis.Workingwithcommunicatingdevices,thesesensorscanproduceatorrentofdata.
Theftbyvisitors,shoppersandevenemployees,isamajorsourceoflossofrevenueforretailers.AllvaluableitemsinthestorecanbeassignedRFIDtags,andthegatesofthestoreareequippedwithRFreaders.Thishelpssecuretheproducts,andreduceleakage(theft),fromthestore.
Supplychainmonitoring
AllcontainersonshipscommunicatetheirstatusandlocationusingRFIDtags.Thus,retailersandtheirsupplierscangainreal-timevisibilitytotheinventorythroughouttheglobalsupplychain.Retailerscanknowexactlywheretheitemsareinthewarehouse,andsocanbringthemintothestoreattherighttime.Thisisparticularlyrelevantforseasonalitemsthatneedtobesoldontime,orelsetheywillbesoldatadiscount.Withitem-levelRFIDtacks,retailersalsogainfullvisibilityofeachitemandcanservetheircustomersbetter.
ElectricityConsumptionTracking
Electricutilitiescantrackthestatusofgeneratingandtransmissionsystems,andalsomeasureandpredicttheconsumptionofelectricity.Sophisticatedsensorscanhelpmonitorvoltage,current,frequency,temperature,andothervitaloperatingcharacteristicsofhugeandexpensiveelectricdistributioninfrastructure.Smartmeterscanmeasuretheconsumptionofelectricityatregularintervalsofonehourorless.Thisdataisanalyzedtomakereal-timedecisionstomaximizepowercapacityutilizationandthetotalrevenuegeneration.
PreventiveMachineMaintenance
Allmachines,includingcarsandcomputers,willfailsometime,becauseoneormoreortheircomponentswillfail.Anypreciousequipmentcouldbeequippedwithsensors.Thecontinuousstreamofdatafromthesensorsdatacouldbemonitoredandanalyzedtoforecastthestatusofkeycomponents,andthus,monitortheoverallmachine’shealth.Preventivemaintenancecanbescheduledtoreducethecostofdowntime.
AnalysisandInsightApplications
BigDatacanbestructuredandanalyzedusingdataminingtechniquestoproduceinsightsandpatternsthatcanbeusedtomakebusinessbetter.
PredictivePolicing
TheLosAngelesPoliceDepartment(LAPD)inventedtheconceptofPredictivePolicing.TheLAPDworkedwithUCBerkeleyresearcherstoanalyzeitslargedatabaseof13millioncrimesrecordedover80years,andpredictedthelikelinessofcrimesofcertaintypes,atcertaintimes,andincertainlocations.Theyidentifiedhotspotsofcrimewherecrimeshadoccurred,andwherecrimewaslikelytohappeninthefuture.Crimepatternsweremathematicallymodeledafterasimpleinsightborrowedfromametaphorofearthquakesanditsaftershocks.Inessence,itsaidthatonceacrimeoccurredinalocation,itrepresentedacertaindisturbanceinharmony,andwouldthus,leadtoagreaterlikelihoodofasimilarcrimeoccurringinthelocalvicinityinthenearfuture.Themodelshowedforeachpolicebeat,thespecificneighborhoodblocksandspecifictimeslots,wherecrimewaslikelytooccur.
Figure2‑0‑7:LAPDofficeronpredictingpolicing(Source:nbclosangeles.com)
Byincludingthepolicecars’patrolschedulesinaccordancewiththemodel’spredictions,theLAPDwasabletoreducecrimeby12%to26%fordifferentcategoriesofcrime.Recently,theSanFranciscoPoliceDepartmentreleaseditsowncrimedataforover2years,sodataanalystscouldmodelthatdataandpreventfuturecrimes.
WinningPoliticalElections
TheUSPresident,BarackObama,wasthefirstmajorpoliticalcandidatetouseBigDatainasignificantway,inthe2008elections.HeisthefirstBigDatapresident.Hiscampaigngathereddataaboutmillionsofpeople,includingsupporters.Theyinventedthe“DonateNow”buttonforuseinemailstoobtaincampaigncontributionsfrommillionsofsupporters.Theycreatedpersonalprofilesofmillionsofsupportersandwhattheyhaddoneandcoulddoforthecampaign.Datawasusedtodetermineundecidedvoterswhocouldbeconvertedtotheirside.Theyprovidedphonenumbersoftheseundecidedvoterstothesupporterstocall,andthenrecordedtheoutcomeofthosecallsallovertheweb,
usinginteractiveapplications.Obamahimselfusedhistwitteraccounttocommunicatehismessagesdirectlywithhismillionsoffollowers.
Aftertheelections,ObamaconvertedthelistofsupporterstoanadvocacymachinethatwouldprovidethegrassrootssupportforthePresident’sinitiatives.Sincethen,almostallcampaignsuseBigData.SenatorBernieSandersusedthesameBigDataplaybooktobuildaneffectivenationalpoliticalmachinepoweredentirelybysmalldonors.Analyst,NateSilver,createdsophisticalpredictivemodelsusinginputsfrommanypoliticalpollsandsurveystowinpunditstosuccessfullypredictwinnersoftheUSelections.Natewashowever,unsuccessfulinpredictingDonaldTrump’srise,andthatshowsthelimitsofBigData.
PersonalHealth
Correctdiagnosisisthesinequanonofeffectivetreatment.Medicalknowledgeandtechnologyisgrowingbyleapsandbounds.IBMWatsonisaBigDataAnalyticsenginethatingestsandmetabolizesallthemedicalinformationintheworld,andthenappliesitintelligentlytoanindividualsituation.Watsoncanprovideadetailedandaccuratemedicaldiagnosisusingcurrentsymptoms,patienthistory,medicationhistory,andenvironmentaltrends,andotherparameters.SimilarproductsmightbeofferedasanApptolicenseddoctors,andevenindividuals,toimproveproductivityandaccuracyinhealthcare.
NewProductDevelopment
Theseapplicationsaretotallynewconceptsthatdidnotexistearlier.
Flexibleautoinsurance
AnautoinsurancecompanycanusetheGPSdatafromcarstocalculatetheriskofaccidentsbasedontravelpatterns.Theautomobilecompaniescanusethecarsensordatatotracktheperformanceofacar.Saferdriverscanberewardedandtheerrantdriverscanbepenalized.
Figure2‑0‑8:GPSbasedtrackingofvehicles
Location-basedretailpromotion
Aretailer,orathird-partyadvertiser,cantargetcustomerswithspecificpromotionsandcouponsbasedonlocationdataobtainedthroughGPS,thetimeofday,thepresenceofstoresnearby,andmappingittotheconsumerpreferencedataavailablefromsocialmediadatabases.Adsandofferscanbedeliveredthroughmobileapps,SMS,andemail.Theseareexamplesofmobileapps.
Recommendationservice
Ecommerceisafastgrowingindustryinthelastcoupleofdecades.Avarietyofproductsaresoldandsharedovertheinternet.Webusers’browsingandpurchasehistoryonecommercesitesisutilizedtolearnabouttheirpreferencesandneeds,andtoadvertiserelevantproductandpricingoffersinreal-time.Amazonusesapersonalizedrecommendationenginesystemtosuggestnewadditionalproductstoconsumersbasedonaffinitiesofvariousproducts.Netflixalsousesarecommendationenginetosuggestentertainmentoptionstoitsusers.
ConclusionBigDatahasapplicabilityacrossallindustries.TherearethreemajortypesofdatasourcesofBigData.Theyarepeople-peoplecommunications,people-machinecommunications,andmachine-machinecommunications.Eachtypehasmanysourcesofdata.Therearethreetypesofapplications.Theyarethemonitoringtype,theanalysistype,andnewproductdevelopment.Thischapterpresentsafewbusinessapplicationsofeachofthosethreetypes.
ReviewQuestionsQ1:WhatarethemajorsourcesofBigData?Describeasourceofeachtype.
Q2:WhatarethethreemajortypesofBigDataapplications?Describetwoapplicationsofeachtype.
Q3:WoulditbeethicaltoarrestsomeonebasedonaBigDataModel’spredictionofthatpersonlikelytocommitacrime?
Q4:AnautoinsurancecompanylearnedaboutthemovementsofapersonbasedontheGPSinstalledinthevehicle.Woulditbeethicaltousethatasasurveillancetool?
Q5:ResearchcandescribeaBigDataapplicationthathasaprovenreturnoninvestmentforanorganization.
LibertyStoresCaseExercise:StepB2TheBoardofDirectorsaskedthecompanytotakeconcreteandeffectivestepstobecomeadata-drivencompany.Thecompanywantstounderstanditscustomersbetter.Itwantstoimprovethehappinesslevelsofitscustomersandemployees.Itwantstoinnovateonnewproductsthatitscustomerswouldlike.Itwantstorelateitscharitableactivitiestotheinterestsofitscustomers.
Q1:Whatkindofdatasourcesshouldthecompanycaptureforthis?
Q2:WhatkindofBigDataapplicationswouldyousuggestforthiscompany?
Chapter3-BigDataArchitecture
IntroductionBigDataApplicationArchitectureistheconfigurationoftoolsandmodulestoaccomplishthewholetask.Anidealarchitecturewouldberesilient,secure,cost-effective,andadaptivetonewneedsandenvironments.Thisisachievedthroughbeginningwithprovenarchitectures,andcreativelyandprogressivelyrestructuringitwithnewelementsasadditionalneedsandproblemsarise.BigDataarchitecturesultimatelyalignwiththearchitectureoftheUniverse,thesourceofallinvincibility.
CASELET:GoogleQueryArchitectureGoogleinventedthefirstBigDataarchitecture.Theirgoalwastogatheralltheinformationontheweb,organizeit,andsearchitforspecificqueriesfrommillionsofusers.Anadditionalgoalwastofindawaytomonetizethisservicebyservingrelevantandprioritizedonlineadvertisementsonbehalfofclients.
Googledevelopedwebcrawlingagentswhichwouldfollowallthelinksinthewebandmakeacopyofallthecontentonallthewebpagesitvisited.
Googleinventedcost-effective,resilient,andfastwaystostoreandprocessallthatexponentiallygrowingdata.Itdevelopedascale-outarchitectureinwhichitcouldlinearlyincreaseitsstoragecapacitybyinsertingadditionalcomputersintoitscomputingnetwork.Thedatafilesweredistributedoverthelargenumberofmachinesinthecluster.ThisdistributedfilessystemwascalledtheGoogleFilesystem,andwastheprecursortoHDFS.
Googlewouldsortorindexthedatathusgatheredsoitcanbesearchedefficiently.Theyinventedthekey-pairNoSQLdatabasearchitecturetostorevarietyofdataobjects.Theydevelopedthestoragesystemtoavoidupdatesinthesameplace.Thusthedatawaswrittenonce,andreadmultipletimes.
Figure3‑0‑1:GoogleQueryArchitecture
GoogledevelopedtheMapReduceparallelprocessingarchitecturewherebylargedatasetscouldbeprocessedbythousandsofcomputersinparallel,witheachcomputerprocessingachunkofdata,toproducequickresultsfortheoveralljob.
TheHadoopecosystemofdatamanagementtoolslikeHadoopdistributedfilesystem(HDFS),columnardatabasesystemlikeHBase,aqueryingtoolsuchasHive,andmore,emergedfromGoogle’sinventions.Stormisastreamingdatatechnologiestoproduceinstantresults.LambdaArchitectureisaY-shapedarchitecturethatbranchesouttheincomingdatastreamforbatchaswellasstreamprocessing.
Q1:WhyshouldGooglepublishitsFileSystemandtheMapReduceparallelprogrammingsystemandsenditintoopen-sourcesystem?
Q2:WhatelsecanbedonewithGoogle’srepositoryofalltheweb’sdata?
StandardBigdataarchitectureHereisthegenericBigDataArchitectureintroducedinChapter1.Therearemanysourcesofdata.Alldataisfunneledinthroughaningestsystem.Thedataisforkedintotwosides:astreamprocessingsystemandabatchprocessingsystem.TheoutcomeoftheseprocessingcanbesentintoNoSQLdatabasesforlaterretrieval,orsentdirectlyforconsumptionbymanyapplicationsanddevices.
Figure3‑0‑2:BigDataApplicationArchitecture
Abigdatasolutiontypicallycomprisestheseaslogicallayers.Eachlayercanberepresentedbyoneormoreavailabletechnologies.
Bigdatasources:Thesourcesofdataforanapplicationdependsuponwhatdataisrequiredtoperformthekindofanalysesyouneed.ThevarioussourcesofBigdataweredescribedinchapter2.Thedatawillvaryinorigin,size,speed,form,andfunction,asdescribedbythe4Vsinchapter1.Datasourcescanbeinternalorexternaltotheorganization.Thescopeofaccesstodataavailablecouldbelimited.Thelevelofstructurecouldbehighorlow.Thespeedofdataanditsquantitywillalsobyhighorlowdependinguponthedatasource.
Dataingestlayer:Thislayerisresponsibleforacquiringdatafromthedatasources.Thedataisthroughascalablesetofinputpointsthatcanacquireatvariousspeedsandinvariousquantities.Thedataissenttoabatchprocessingsystem,astreamprocessingsystem,ordirectlytoastoragefilesystem(suchasHDFS).Complianceregulationsandgovernancepoliciesimpactwhatdatacanbestoredandforhowlong.
BatchProcessinglayer:TheanalysislayerreceivesdatafromtheingestpointorfromthefilesystemorfromtheNoSQLdatabases.Dataisprocessedusingparallelprogrammingtechniques(suchasMapReduce)toprocessitandproducethedesiredresults.Thisbatchprocessinglayerthusneedstounderstandthedatasourcesanddatatypes,thealgorithms
thatwouldworkonthatdata,andtheformatofthedesiredoutcomes.Theoutputofthislayercouldbesentforinstantreporting,orstoredinaNoSQLdatabasesforanon-demandreport,fortheclient.
StreamingProcessinglayer:Thislayerreceivesdatadirectlyfromtheingestpoint.Dataisprocessedusingparallelprogrammingtechniques(suchasMapReduce)toprocessitinrealtime,andproducethedesiredresults.Thislayerthusneedstounderstandthedatasourcesanddatatypesextremelywell,andthesuper-lightalgorithmsthatwouldworkonthatdatatoproducethedesiredresults.TheoutcomeofthislayertoocouldbestoredintheNoSQLDatabases.
DataOrganizingLayer:Thislayerreceivesdatafromboththebatchandstreamprocessinglayers.Itsobjectiveistoorganizethedataforeasyaccess.ItisrepresentedbyNoSQLdatabases.SQL-likelanguageslikeHiveandPigcanbeusedtoeasilyaccessdataandgeneratereports.
DataConsumptionlayer:Thislayerconsumestheoutputprovidedbytheanalysislayers,directlyorthroughtheorganizinglayer.Theoutcomecouldbestandardreports,dataanalytics,dashboardsandothervisualizationapplications,recommendationengine,onmobileandotherdevices.
InfrastructureLayer:Atbottomthereisalayerthatmanagestherawresourcesofstorage,compute,andcommunication.Thisisincreasinglyprovidedthroughacloudcomputingparadigm.
DistributedFileSystemLayer:ItwouldalsoincludetheHadoopDistributedFileSystem(HDFS).Itwouldalsoincludesupportingapplications,suchasYARN(YetAnotherResourceManager),thatenabletheefficientaccesstodatastorageanditstransfer.
BigDataArchitectureexamplesEverymajororganizationandapplicationshasauniqueoptimizedinfrastructuretosuititsspecificneeds.HerebelowaresomearchitectureexamplesfromsomeveryprominentusersanddesignersofBigDataapplications.
IBMWatson
IBMWatsonusesSparktomanageincomingdatastreams.ItalsousesSpark’sMachineLearninglibrary(MLLib)toanalyzedataandpredictdiseases.
Netflix
Thisisoneofthelargestprovidersofonlinevideoentertainment.Theyhandle400Billiononlineeventsperday.Asacutting-edgeuserofbigdatatechnologies,theyareconstantlyinnovatingtheirmixoftechnologiestodeliverthebestperformance.Kafkaisthecommonmessagingsystemforallincomingrequests.TheyhosttheentireinfrastructureonAmazonWebServices(AWS).ThedatabaseisAWS’S3aswellasCassandraandHbasetostoredata.Sparkisusedforstreamprocessing.
(Source:Netflix)
Ebay
Ebayisthesecond-largestEcommercecompanyintheworld.Itdelivers800millionlistingsfrom25millionsellersto160millionbuyers.Tomanagethishugestreamofactivity,EBayusesastackofHadoop,Spark,Kafka,andotherelements.TheythinkthatKafkaisthebestnewthingforprocessingdatastreams.
VMWare
HereisVMware’sviewofaBigDataarchitecture.Itissimilarto,butmoredetailedthan,ourmainbigarchitecturediagram.
TheWeatherCompany
TheWeathercompanyservesweatherdatagloballythroughwebsitesandmobileapps.ItusesstreamingarchitectureusingApacheSpark.
TicketMaster
Thisistheworld’slargestcompanythatsellseventtickets.Theirgoalistomaketicketsavailabletopurchaseforrealfans,andpreventbadactorsfrommanipulatingthesystemtoincreasethepriceoftheticketsinthesecondarymarkets.
Thegoalofthisprofessionalnetworkingcompanyistomaintainanefficientsystemforprocessingthestreamingdataandmakethelinkoptionsavailableinreal-time.
Paypal
Thispayments-facilitationcompanyneedstounderstandandacquirecustomers,andprocessalargenumberofpaymenttransactions.
CERN
Thispremierhigh-energyphysicsresearchlabcomputepetabytesofdatausingin-memorystreamprocessingtoprocessdatafrommillionsofsensorsanddevices.
ConclusionBigDataapplicationsarearchitectedtodostreamaswellasbatchprocessing.Dataisingestedandfedintostreamingandbatchprocessing.MosttoolsusedforbigdataprocessingareopensourcetoolsservedthroughtheApachecommunity,andsomekeydistributorsofthosetechnologies.
ReviewQuestionsQ1:DescribetheBigDataprocessingarchitecture.
Q2:WhatareGoogle’scontributionstoBigdataprocessing?
Q3:WhataresomeofthehottesttechnologiesvisibleinBigDataprocessing?
LibertyStoresCaseExercise:StepB3ThewantstobuildascalableandfuturisticplatformforitsBigData.
Q1:WhatkindofBigDataProcessingarchitecturewouldyousuggestforthiscompany
Section2
ThissectioncoverstheimportantBigDatatechnologiesdefinedintheBigDataarchitecturespecifiedinchapter3.
Chapter4willcoverHadoopanditsDistributedFileSystem(HDFS)
Chapter5willcovertheparallelprocessingalgorithm,MapReduce.
Chapter6willNoSQLdatabasessuchasHBaseandCassandra.ItwillalsocoverPigandHivelanguagesusedforaccessingthosedatabases.
Chapter7willcoverSpark,afastandintegratedstreamingdatamanagementplatform.
Chapter8willcoverDataIngestsystems,usingApacheKafka
Chapter9willcoverCloudComputingmodel.
Chapter4:DistributedComputingusingHadoopIntroductionAdistributedsystemisacleverwayofstoringhugequantitiesofdata,securelyandcost-effectively,forspeedandease,forretrievalandprocessing,usinganetworkedcollectionofcommoditymachines.Theidealdistributedfilesystemwouldstoreinfiniteamountsofdatawhilemakingthecomplexitycompletelytransparenttotheuser,andenableeasyaccesstotherightdatainstantly.Thiswouldbeachievedbystoringfragmentsofdataatdifferentlocations,andinternallymanagingthelower-leveltasksofstoringandreplicatingdataacrossthenetwork.ThedistributedsystemultimatelyleadstothecreationoftheunboundedcosmiccomputerthatisalignedwiththeUnifiedFieldofallthelawsofnature.
HadoopFrameworkTheApacheHadoopdistributedcomputingframeworkiscomposedofthefollowingmodules:
1. HadoopCommon–containslibrariesandutilitiesneededbyotherHadoopmodules
2. HadoopDistributedFileSystem(HDFS)–adistributedfile-systemthatstoresdataoncommoditymachines,providingveryhighaggregatebandwidthacrossthecluster
3. YARN–aresource-managementplatformresponsibleformanagingcomputingresourcesinclustersandusingthemforschedulingofusers’applications,and
4. MapReduce–animplementationoftheMapReduceprogrammingmodelforlargescaledataprocessing.
ThischapterwillcoverHadoopCommon,HDFS,andYARN.ThenextchapterwillcoverMapReduce.
HDFSDesignGoalsTheHadoopdistributedfilesystem(HDFS)isadistributedandscalablefile-system.Itisdesignedforapplicationsthatdealwithlargedatasizes.Itisalsodesignedtodealwithmostlyimmutablefiles,i.e.writedataonce,butreaditmanytimes.
HDFShasthefollowingmajordesigngoals:
1. Hardwarefailuremanagement–itwillhappen,andonemustplanforit.2. Hugevolume–createcapacityforlargenumberofhugefilesizes,withfast
read/writethroughput3. Highspeed–createamechanismtoprovidelowlatencyaccesstostreaming
applications4. Highvariety–Maintainsimpledatacoherence,bywritingdataoncebutreading
manytimes.5. Open-source–Maintaineasyaccessibilityofdatausinganyhardware,software,
anddatabaseplatform6. Networkefficiency–Minimizenetworkbandwidthrequirement,byminimizing
datamovement
Master-SlaveArchitectureHadoopisanarchitecturefororganizingcomputersinamaster-slaverelationshipthathelpsachievegreatscalabilityinprocessing.AnHDFSclusterhastwotypesofnodesoperatinginamaster−workerpattern:asinglemasternode(calledNameNode),andalargenumberofslaveworkernodes(calledDataNodes).AsmallHadoopclusterincludesasinglemasterandmultipleworkernodes.AlargeHadoopclusterwouldconsistofamasterandthousandsofsmallordinarymachinesasworkernodes.
Figure4‑0‑1:Master-SlaveArchitecture
Themasternodemanagestheoverallfilesystem,itsnamespace,andcontrolstheaccesstofilesbyclients.Themasternodeisawareofthedata-nodes:i.e.whatblocksofwhichfilearestoredonwhichdatanode.Italsocontrolstheprocessingplanforallapplicationsrunningonthedataonthecluster.Thereisonlyonemasternode.Unfortunately,thatmakesitasinglepointoffailure.Therefore,wheneverpossible,themasternodehasahotbackupjustincasethemasternodediesunexpectedly.Themasternodeusesatransactionlogtopersistentlyrecordeverychangethatoccurstofilesystemmetadata.
Theworkernodesstorethedatablocksintheirstoragespace,asdirectedbythemasternode.Eachworkernodetypicalcontainsmanydiskstomaximizestoragecapacityandaccessspeed.Eachworkernodehasitsownlocalfilesystem.Aworkernodehasnoawarenessofthedistributedfilestructure.Itsimplystoreseachblockofdataasdirected,asifeachblockwereaseparatefile.TheDataNodesstoreandserveupblocksofdataoverthenetworkusingablockprotocol,underthedirectionoftheNameNode.
Figure4‑0‑2:HadoopArchitecture(Source:Hadoop.apache.org)
TheNamenodestoresallrelevantinformationaboutalltheDataNodes,andthefilesstoredinthoseDataNodes.TheNameNodewillcontain:
-ForeveryDataNode,itsname,Rack,Capacity,andHealth
-ForeveryFile,itsName,replicas,Type,Size,TimeStamp,Location,Health,etc.
ItaDataNodefails,thereisnoseriousproblem.ThedataonthefaileddataNodewillbeaccessedfromitsreplicasonotherDataNodes.ThefailedDataNodecanbeautomaticallyrecreatedonanothermachine,bywritingallthosefileblocksoffromtheotherhealthyreplicas.Eachdata-nodesendsaheartbeatmessagetothename-nodeperiodically.Withoutthismessage,theDataNodeisassumedtobedead.TheDataNodereplicationeffortwouldautomaticallykick-intoreplacethedeaddata-node.
Thefilesystemhasasetoffeaturesandcapabilitiestocompletelyhidethesplinteringandscatteringofdata,andenabletheusertodealwiththedataatahigh,logicallevel.
TheNameNodetriestoensurethatfilesareevenlyspreadacrossthedata-nodesinthecluster.Thatbalancesthestorageandcomputingload,andalsolimitstheextentoflossfromthefailureofanode.TheNameNodealsotriestooptimizethenetworkingload.Whenretrievingdataororderingtheprocessing,theNameNodetriestopickFragmentsfrommultiplenodestobalancetheprocessingloadandspeedupthetotallyprocessingeffort.TheNameNodealsotriestostorefragmentsoffilesonthesamenodeforspeedofreadandwriting.Processingisdoneonthenodewherethefilefragmentisstored.
Anypieceofdataisstoredtypicallyonthreenodes:twoonthesamerack,andoneona
differentrack.Datanodescantalktoeachothertorebalancedata,tomovecopiesaround,andtokeepthereplicationofdatahigh.
BlocksystemHDFSstoreslargefiles(typicallygigabytestoterabytes)bystoringsegments(calledblocks)ofthefileacrossmultiplemachines.AblockofdataisthefundamentalstorageunitinHDFS.Datafilesaredescribed,readandwritteninblock-sizedgranularity.Allstoragecapacityandfilesizesaremeasuredinblocks.Ablockrangesfrom16-128MBinsize,withadefaultblocksizeof64MB.Thus,anHDFSfileischoppedupinto64MBchunks,andifpossible,eachchunkwillresideonadifferentDataNode.
Everydatafiletakesupanumberofblocksdependinguponitssize.Thusa100MBfilewilloccupytwoblocks(100MBdividedby64MB),withsomeroomtospare.Everystoragediskcanaccommodateanumberofblocksdependinguponthesizeofthedisk.Thusa1Terabytestoragewillhave16000blocks(1TBdividedby64MB).
Everyfileisorganizedasaconsecutivelynumberedsequenceofblocks.Afile’sblocksarestoredphysicallyclosetoeachotherforeaseofaccess,asfaraspossible.Thefile’sblocksizeandreplicationfactorareconfigurablebytheapplicationthatwritesthefileonHDFS.
EnsuringDataIntegrityHadoopensuresthatnodatawillbelostorcorrupted,duringstorageorprocessing.Thefilesarewrittenonlyonce,andneverupdatedinplace.Theycanbereadmanytimes.Onlyoneclientcanwriteorappendtoafile,atatime.Noconcurrentupdatesareallowed.
Ifadataisindeedlostorcorrupted,orifapartofthediskgetscorrupted,anewhealthyreplicaforthatlostblockwillbeautomaticallyrecreatedbycopyingfromthereplicasonotherdata-nodes.Atleastoneofthereplicasisstoredonadata-nodeonadifferentrack.Thisguardsagainstthefailureoftherackofnodes,orthenetworkingrouter,onit.
AchecksumalgorithmisappliedonalldatawrittentoHDFS.Aprocessofserializationisusedtoturnfilesintoabytestreamfortransmissionoveranetworkorforwritingtopersistentstorage.Hadoophasadditionalsecuritybuiltin,usingKerberosverifier.
InstallingHDFSItispossibletorunHadooponanin-houseclusterofmachines,oronthecloudinexpensively.Asanexample,TheNewYorkTimesused100AmazonElasticComputeCloud(EC2)instances(DataNodes)andaHadoopapplicationtoprocess4TBofrawimageTIFFdatastoredinAmazonSimpleStorageService(S3)into11millionfinishedPDFsinthespaceof24hoursatacomputationcostofabout$240(notincludingbandwidth).SeeChapter9foraprimeronCloudComputing.SeeAppendix1forastep-by-steptutorialoninstallingHadooponAmazonEC2.
HadoopiswritteninJava.HadoopalsorequiresaworkingJavainstallation.InstallingHadooptakesalotofresources.Forexample,allinformationaboutfragmentsoffilesneedstobeinName-nodememory.AthumbruleisthatHadoopneedsapproximately1GBmemorytomanage1Mfilefragments.ManyeasymechanismsexisttoinstalltheentireHadoopstack.UsingaGUIsuchasClouderaResourcesManagertoinstallaClouderaHadoopstackiseasy.Thisstackincludes,HDFS,andmanyotherrelatedcomponents,suchasHBase,Pig,YARN,andmore.InstallingitonaclusteronacloudservicesproviderlikeAWSiseasierthaninstallingJavaVirtualMachines(JVMs)onHDFScanbeinstalledbyusingClouderaGUIResourcesManager.Ifdoingfromcommandline,downloadHadoopfromoneoftheApachemirrorsites
HadoopiswritteninJava.AndmostaccesstofilesisprovidedthroughJavaabstractclassorg.apache.hadoop.fs.FileSystem.HDFScanbemounteddirectlywithaFilesysteminUserspace(FUSE)virtualfilesystemonLinuxandsomeotherUnixsystems.FileaccesscanbeachievedthroughthenativeJavaapplicationprogramminginterface(API).AnotherAPI,calledThrift,helpstogenerateaclientinthelanguageoftheusers’choosing(suchasC++,Java,Python).WhentheHadoopcommandisinvokedwithaclassnameasthefirstargument,itlaunchesaJavavirtualmachine(JVM)toruntheclass,alongwiththerelevantHadooplibraries(andtheirdependencies)ontheclasspath.
HDFShasaUNIX-likecommandlikeinterface(CLI).UseshshelltocommunicatewithHadoop.HDFShasUNIX-likepermissionsmodelforfilesanddirectories.Therearethreeprogressivelyincreasinglevelsofpermissions:read(r),write(w),andexecute(x).Createahduser,andcommunicateusingsshshellonthelocalmachine.
%hadoopfs-help##getdetailedhelponeverycommand.
ReadingandWritingLocalFilesintoHDFS
Therearetwodifferentwaystotransferdata:fromthelocalfilesystem,orforman
input/outputstream.CopyingafilefromthelocalfilesystemtoHDFScanbedoneby:
%hadoopfs-copyFromLocalpath/filename
ReadingandWritingDataStreamsintoHDFS
ReadafilefromHDFSbyusingajava.net.URLobjecttoopenastreamtoreadthedatarequiresashortscript,asbelow.
InputStreamin=null;
Start{
instream=newURL(“hdfs://host/path”).openStream();//detailsofprocessin}
Finish{IOUtils.closeStream(instream);}
Asimplemethodtocreateanewfileisasfollows:
publicFSDataOutputStreamcreate(Pathp)throwsIOException
Datacanbeappendedtoanexistingfileusingtheappend()method:
publicFSDataOutputStreamappend(Pathp)throwsIOException
Adirectorycanbecreatedbyasimplemethod:
publicbooleanmkdirs(Pathp)throwsIOException
Listthecontentsofadirectoryusing:
publicFileStatus[]listStatus(Pathp)throwsIOException
publicFileStatus[]listStatus(Pathp,PathFilterfilter)throwsIOException
SequenceFilesTheincomingdatafilescanrangefromverysmalltoextremelylarge,andwithdifferentstructures.BigDatafilesarethereforeorganizedquitedifferentlytohandlethediversityoffilesizesandtype.LargefilesarestoredasHDFSfiles,withFileFragmentsdistributedacrossthecluster.However,smallerfilesshouldbebunchedtogetherintosinglesegmentforefficientstorage.
SequenceFilesareaspecializeddatastructurewithinHadooptohandlesmallerfileswithsmallerrecordsizes.SequenceFileusesapersistentdatastructurefordataavailableinkey-valuepairformat.Thesehelpefficientlystoresmallerobjects.HDFSandMapReducearedesignedtoworkwithlargefiles,sopackingsmallfilesintoaSequenceFilecontainer,makesstoringandprocessingthesmallerfilesmoreefficientforHDFSandMapReduce.
Sequencefilesarerow-orientedfileformats,whichmeansthatthevaluesforeachrowarestoredcontiguouslyinthefile.Thisformatsareappropriatewhenalargenumberofcolumnsofasinglerowareneededforprocessingatthesametime.Thereareeasycommandstocreate,readandwriteSequenceFilestructures.SortingandmergingSequenceFilesisnativetoMapReducesystem.AMapFileisessentiallyasortedSequenceFilewithanindextopermitlookupsbykey.
YARNYARN(YetAnotherResourceNegotiator)isthearchitecturalcenterofHadoop,Itisoftencharacterizedasalarge-scale,distributedoperatingsystemforbigdataapplications.YARNmanagesresourcesandmonitorsworkloads,inasecuremulti-tenantenvironment,whileensuringhighavailabilityacrossmultipleHadoopclusters.YARNalsobringsgreatflexibilityasacommonplatformtorunmultipletoolsandapplicationssuchasinteractiveSQL(e.g.Hive),real-timestreaming(e.g.Spark),andbatchprocessing(MapReduce),toworkondatastoredinasingleHDFSstorageplatform.Itbringsclustersmorescalabilitytoexpandbeyond1000nodes,italsoimprovesclusterutilizationthroughdynamicallocationofclusterresourcestovariousapplications.
Figure4‑0‑3:HadoopDistributedArchitectureincludingYARN
TheResourceManagerinYARNhastwomaincomponents:SchedulerandApplicationsManager.
YARNSchedulerallocatesresourcestothevariousrequestingapplications.ItdoessobasedonanabstractnotionofaresourceContainerwhichincorporateselementssuchasMemory,CPU,Diskstorage,Network,etc.EachmachinealsohasaNodeManagerthatmanagesalltheContainersonthatmachine,andreportsstatusonresourcesandContainerstotheYARNScheduler.
YARNApplicationsManageracceptsnewjobsubmissionsfromtheclient.ItthenrequestsafirstresourceContainerfortheapplication-specificApplicationMasterprogram,andmonitorsthehealthandexecutionoftheapplication.Oncerunning,theApplicationMasterdirectlynegotiatesadditionalresourcecontainersfromtheSchedulerasneeded.
ConclusionHadoopisthemajortechnologyformanagingbigdata.HDFSsecurelystoresdataonlargeclustersofcommoditymachines.Amastermachinecontrolsthestorageandprocessingactivitiesoftheworkermachines.ANameNodecontrolsthenamespaceandstorageinformationforthefilesystemontheDataNodes.AmasterJobTrackercontrolstheprocessingoftasksattheDataNodes.YARNistheresourcesmanagerthatmanagesallresourcesdynamicallyandefficientlyacrossallapplicationsonthecluster.HadoopFilesystemandotherpartsoftheHadoopstackaredistributedbymanyvendors,andcanbeeasilyinstalledoncloudcomputinginfrastructure.HadoopinstallationtutorialisinAppendixA.
ReviewQuestionsQ1:HowdoesHadoopdifferfromatraditionalfilesystem?
Q2:WhatarethedesigngoalsforHDFS?
Q3:HowdoesHDFSensuresecurityandintegrityofdata?
Q4:Howdoesamasternodedifferfromtheworkernode?
Chapter5–ParallelProcessingwithMapReduce
Introduction
Aparallelprocessingsystemisacleverwaytoprocesshugeamountsofdatainashortperiodoftimebyenlistingtheservicesofmanycomputingdevicestoworkonpartsofthejob,simultaneously.Theidealparallelprocessingsystemwillworkacrossanycomputationalproblem,usinganynumberofcomputingdevices,acrossanysizeofdatasets,witheaseandhighprogrammerproductivity.Thisisachievedbyframingtheprobleminawaythatitcanbebrokendownintomanyparts,suchthatthateachpartcanbepartiallyprocessedindependentlyoftheotherparts;andthentheintermediateresultsfromprocessingthepartscanbecombinedtoproduceafinalsolution.Infiniteparallelprocessingistheessenceofinfinitedynamismofthelawsofnature.
MapReduceOverview
MapReduceisaparallelprogrammingframeworkforspeedinguplargescaledataprocessingforcertaintypesoftasks.ItachievessowithminimalmovementofdataondistributedfilesystemssuchasHDFSclusters,toachievenear-realtimeresults.Therearetwomajorpre-requisitesforMapReduceprogramming.(a)Theapplicationmustlenditselftoparallelprogramming.(b)Thedatacanbeexpressedinkey-valuepairs.
MapReduceprocessingissimilartoUNIXsequence(alsocalledpipe)structure
e.g.theUNIXcommand:
grep|sort|countmyfile.txt
willproduceawordcountinthetextdocumentcalledmyfile.txt.
Therearethreecommandsinthissequence,andtheyworkasfollows:(a)grepiscommandtoreadthetextfileandcreateanintermediatefilewithonewordonaline;(b)sortcommandwillsortthatintermediatefile,andproduceanalphabeticallysortedlistofwordsinthatset;(c)thecountcommandwillworkonthatsortedlist,toproducethenumberofoccurrencesofeachword,anddisplaytheresultstotheuserina“word,frequency”pairformat.
Forexample:Supposemyfile.txtcontainsthefollowingtext:
Myfile:Wearegoingtoapicnicnearourhouse.Manyofourfriendsarecoming.Youarewelcometojoinus.Wewillhavefun.
TheoutputsofGrep,SortandWordcountwillasshownbelow.
Grep Sort WordCount
We a a 1
are are are 3
going are coming 1
to are friends 1
a coming fun 1
picnic friends going 1
near fun have 1
our going house 1
house have join 1
Many house many 1
of join near 1
our many of 1
friends near our 2
are of picnic 1
coming our to 2
You our us 1
are picnic we 2
welcome to welcome 1
to to will 1
join us you 1
us We
we we
will welcome
have will
fun you
Ifthefileisverylarge,thenitwillbetakethecomputeralongtimetoprocessit.Parallelprocessingcanhelphere.
MapReducespeedsupthecomputationbyreadingandprocessingsmallchunksoffile,bydifferentcomputersinparallel.Thusifafilecanbebrokendowninto100smallchunks,
eachchunkcanbeprocessedataseparatecomputerinparallel.Thetotaltimetakentoprocessthefilecouldbe1/100ofthetimetakenotherwise.However,nowtheresultsofthecomputationonsmallchunksareresidingina100differentplaces.Theselargenumberofpartialresultsneedtobecombinedtoproduceacompositeresult.TheresultsoftheoutputsfromvariouschunkswillbecombinedbyanotherprogramcalledtheReduceprogram.
TheMapstepwilldistributethefulljobintosmallertasksthatcanbedoneonseparatecomputerseachusingonlyapartofthedataset.TheresultoftheMapstepwillbeconsideredasintermediateresults.TheReducestepwillreadtheintermediateresults,andwillcombineallofthemandproducethefinalresult.Theprogrammerneedstospecifiesthefunctionallogicforboththemapandreducesteps.Thesorting,betweentheMapandReducesteps,doesnotneedtobespecifiedandisautomaticallytakencareoftheMapReducesystemasastandardserviceprovidedtoeveryjob.Thesortingofthedatarequiresafieldtosorton.Thustheintermediateresultsneedtohavesomekindofakeyfield,andasetofassociatednon-keyattribute(s)forthatkey.
Figure5‑0‑1:MapReduceArchitecture
Inpractice,tomanagethevarietyofdatastructuresstoredinthefilesystem,dataisstoredasonekeyandonenon-keyattribute.Thusthedataisrepresentedasakey-valuepair.Theintermediateresults,andthefinalresultsallwillalsobeinkey-pairformat.ThusakeyrequirementfortheuseofMapReduceparallelprocessingsystemisthattheinputdataandoutputdatamustbothberepresentedinkey-valuesformats.
Mapstepreadsdatainkey-valuepairformat.Theprogrammerdecidewhatshouldbethecharacteristicsofthekeyandvaluefields.TheMapstepproducesresultsinkey-valuepairformat.However,thecharacteristicsofthekeysproducedbytheMapstep,i.e.theintermediateresults,neednotbesamekeysattheinputdata.So,thosecanbecalledkey2-value2pairs.
TheReducestepreadsthekey2-value2pairs,theintermediateresultsproducedbytheMapstep.Reducestepwillproduceanoutputusingthesamekeysthatitread.Onlythevaluesassociatedwiththosekeyswillchangethoughasaresultofprocessing.Thusitcan
belabeledaskey2-value3format.
Supposethetextinthemyfile.txtcanbesplitinto4approximatelyequalsegments.Itcouldbedonewitheachsentenceasaseparatepieceoftext.Thefoursegmentswilllookasfollowing:
Segment1:Wearegoingtoapicnicnearourhouse.
Segment2:Manyofourfriendsarecoming.
Segment3:Youarewelcometojoinus.
Segment4:Wewillhavefun.
Thustheinputtothe4processorsintheMapStepwillbeinkey-valuepairformat.Thefirstcolumnisthekey,whichistheentiresentenceinthiscase.Thesecondcolumnisthevalue,whichinthisapplicationisthefrequencyofthesentence.
Wearegoingtoapicnicnearourhouse. 1
Manyofourfriendsarecoming. 1
Youarewelcometojoinus. 1
Wewillhavefun. 1
Thistaskcanbedoneinparallelbyfourprocessors.Eachofthissegmentwillbetaskforadifferentprocessor.Thuseachtaskwillproduceafileofwords,withacountof1.Therewillbefourintermediatefiles,in<key,value>pairformat,shownbelow.
Key2 Value2 Key2 Value2 Key2 Value2 Key2 Value2
we 1 many 1 you 1 we 1
are 1 of 1 are 1 will 1
going 1 our 1 welcome 1 have 1
to 1 friends 1 to 1 fun 1
a 1 are 1 join 1
picnic 1 coming 1 us 1
near 1
our 1
house 1
ThesortprocessinherentwithinMapReducewillsorteachoftheintermediatefiles,andproducethefollowingsortedkey-pairvalues:
Key2 Value2 Key Value2 Key Value2 Key Value2
a 1 are 1 are 1 fun 1
are 1 coming 1 join 1 have 1
going 1 friends 1 to 1 we 1
house 1 many 1 us 1 will 1
near 1 of 1 welcome 1
our 1 our 1 you 1
picnic 1
to 1
we 1
TheReducefunctionwillreadthesortedintermediatefiles,andcombinethecountsforalltheuniquewords,toproducethefollowingoutput.Thekeysremainthesameasintheintermediateresults.However,thevalueschangeascountsfromeachoftheintermediatefilesareaddedupforeachkey.Forexample,thecountfortheword‘are’goesupto3.
Key2 Value3
a 1
are 3
coming 1
friends 1
fun 1
going 1
have 1
house 1
join 1
many 1
near 1
of 1
our 2
picnic 1
to 2
us 1
we 2
welcome 1
will 1
you 1
ThisoutputwillbeidenticaltothatproducedbytheUNIXsequenceearlier.
MapReduceprogrammingAdataprocessingproblemneedstobetransformedintotheMapReducemodel.Thefirststepistovisualizetheprocessingplanintoamapandareducestep.Whentheprocessinggetsmorecomplex,thiscomplexitycanbegenerallymanifestedinhavingmoreMapReducejobs,ormorecomplexmapandreducejobs.HavingmorebutsimplerMapReducejobsleadstomoreeasilymaintainablemapperandreducerprograms.
MapReduceDataTypesandFormats
MapReducehasasimplemodelofdataprocessing:inputsandoutputsforthemapandreducefunctionsarekey-valuepairs.ThemapandreducefunctionsinHadoopMapReducehavethefollowinggeneralform:
map:(K1,V1)→list(K2,V2)
reduce:(K2,list(V2))→list(K3,V3)
Ingeneral,themapinputkeyandvaluetypes(K1andV1)aredifferentfromthemapoutputtypes(K2andV2).However,thereduceinputmusthavethesametypesasthemapoutput,althoughthereduceoutputtypesmaybedifferentagain(K3andV3).SinceMapperandReducerareseparateclasses,thetypeparametershavedifferentscopes,
Hadoopcanprocessmanydifferenttypesofdataformats,fromflattextfilestodatabases.Aninputsplitisachunkoftheinputthatisprocessedbyasinglemap.Eachmapprocessesasinglesplit.Eachsplitisdividedintorecords,andthemapprocesseseachrecord—akey-valuepair—inturn.Splitsandrecordsarelogical:andmaymaptoafullfile,apartofafile,oracollectionoffiles.Inadatabasecontext,asplitmightcorrespondtoarangeofrowsfromatableandarecordtoarowinthatrange
WritingMapReduceProgramming
Startbywritingpseudocodeforthemapandreducefunctions.TheprogramcodeforboththemapandthereducefunctioncanthenbewritteninJavaorotherlanguages.InJava,themapfunctionisrepresentedbythegenericMapperclass.Itusesfourparameters:inputkey,inputvalue,outputkey,outputvalue.Thisclassusesanabstractmap()method.Thismethodreceivedtheinputkeyandinputvalue.Itwouldnormallyproduceandoutputkeyandoutputvalue.Formorecomplexproblems,itisbettertouseahigher-levellanguagethanMapReduce,suchasPig,Hive,Cascading,Crunch,orSpark.
Amappercommonlyperformsinputformatparsing,projection(selectingtherelevantfields),andfiltering(selectingtherecordsofinterest).Thereducertypicallycombines
(addsoraverages)thosevalues.
Figure5‑0‑2:MapReduceprogramFlow
Herebelowisthestep-by-steplogicImaginethatwewanttodoawordcountofalluniquewordsinatext.
1. Thebigdocumentissplitintomanysegments.Themapstepisrunoneachsegmentofdata.Theoutputwillbeasetofkey,valuepairs.Inthiscase,thekeywillbeawordinthedocument.
2. Thesystemwillgatherthekey,valuepairoutputsfromallthemappers,andwillsortthembykey.Thesortedlistitselfmaythenbesplitintoafewsegments.
3. AReducertaskwillreadthesortedlistandproduceacombinedlistofwordcounts.
HereistheJavacodeforwordcount:.
map(Stringkey,Stringvalue):
foreachwordwinvalue:
EmitIntermediate(w,“1”);
reduce(Stringkey,Iteratorvalues):
intresult=0;
foreachvinvalues:
result+=ParseInt(v);
Emit(AsString(result));
TestingMapReducePrograms
Mapperprogramsrunningonaclustercanbecomplicatedtodebug.Thetime-honored
wayofdebuggingprogramsisviaprintstatements.However,withtheprogramseventuallyrunningontensorthousandsofnodes,itisbesttodebugtheprogramsinstages.Therefore,runtheprogramusingsmallsampledatasetstoensurethattheprogramisworkingcorrectly.Expandtheunitteststocoverlargerdatasetandrunitonacluster.Ensurethatthemapperorreducercanhandletheinputscorrectly.Runningagainstthefulldatasetislikelytoexposesomemoreissues,whichshouldbefixed,byalteringyourmapperorreducertohandlethenewcases.Aftertheprogramisworking,theprogrammaybetunedtomaketheentireMapReducejobrunfaster.
Itmaybedesirabletosplitthelogicintomanysimplemappersandchainingthemintoasinglemapperusingafacility(theChainMapperlibraryclass)builtintoHadoop.Itcanrunachainofmappers,followedbyareducerandanotherchainofmappers,inasingleMapReducejob.
MapReduceJobsExecution
AMapReducejobisspecifiedbytheMapprogramandtheReduceprogram,alongwiththedatasetsassociatedwiththatjob.ThereisanothermasterprogramthatresidesandrunsendlesslyontheNameNode.ItiscalledtheJobtracker,andittrackstheprogressoftheMapReducejobsfrombeginningtothecompletion.Hadoopdividesthejobintotwotasks:maptasksandreducetasks.HadoopmovestheMapandReducecomputationlogictoeachDataNodethatishostingapartofthedata.ThecommunicationbetweenthenodesisaccomplishedusingYARN,Hadoop’snativeresourcemanager.
Themastermachine(NameNode)iscompletelyawareofthedatastoredoneachoftheworkermachines(DataNodes).Itschedulesthemaporreducejobstotasktrackerswithfullawarenessofthedatalocation.Forexample:ifnodeAcontainsdata(x,y,z)andnodeBcontainsdata(a,b,c),thejobtrackerschedulesnodeBtoperformmaporreducetaskson(a,b,c)andnodeAwouldbescheduledtoperformmaporreducetaskson(x,y,z).Thisreducesthedatatrafficandpreventschokingofthenetwork.
EachDataNodehasamasterprogramcalledtheJobtracker.ThisprogrammonitorstheexecutionofeverytaskassignedtoitbytheNameNode.Whenthetaskiscompleted,theTasktrackersendsacompletionmessagetotheJobTrackerprogramonthe
Thejobsandtasksworkinamaster-slavemode.
Figure5‑0‑3:HierarchicalMonitoringArchitecture
WhenthereismorethanonejobinaMapReduceworkflow,itisnecessarytheybeexecutedintherightorder.Foralinearchainofjobsitmightbeeasy.Foramorecomplexdirectedacyclicgraph(DAG)ofjobs,therearelibrariesthatcanhelporchestrateyourworkflow.OronecanuseApacheOozie,asystemforrunningworkflowsofdependentjobs.
Oozieconsistsoftwomainparts:aworkflowenginethatstoresandrunsworkflowscomposedofdifferenttypesofHadoopjobs(MapReduce,Pig,Hive,andsoon),andacoordinatorenginethatrunsworkflowjobsbasedonpredefinedschedulesanddata
availability.Ooziehasbeendesignedtoscale,anditcanmanagethetimelyexecutionofthousandsofworkflowsinaHadoopcluster.
ThedatasetfortheMapReducejobisdividedintofixed-sizepiecescalledinputsplits,orjustsplits.Hadoopcreatesonemaptaskforeachsplit,whichrunstheuser-definedmapfunctionforeachrecordinthesplit.ThetasksarescheduledusingYARNandrunonnodesinthecluster.YARNensuresthatifataskfailsorinordinatelydelayed,itwillbeautomaticallyscheduledtorunonadifferentnode.Theoutputsofthemapjobsarefedasinputtothereducejob.Thatlogicisalsopropagatedtothenode(s)thatwilldothereducejobs.Tosaveonbandwidth,Hadoopallowstheuseofacombinerfunctiononthemapoutput.Thenthecombinerfunction’soutputformstheinputtothereducefunction.
HowMapReduceWorks
AMapReducejobcanbeexecutedwithasinglemethodcall:submit()onaJobobject.WhentheresourcemanagerreceivesacalltoitssubmitApplication()method,ithandsofftherequesttotheYARNscheduler.Theschedulerallocatesacontainer,andtheresourcemanagerthenlaunchestheapplicationmaster’sprocess.TheapplicationmasterforMapReducejobsisaJavaapplicationwhosemainclassisMRAppMaster.Itinitializesthejobbycreatinganumberofbookkeepingobjectstokeeptrackofthejob’sprogress.Itretrievestheinputsplitscomputedintheclientfromthesharedfilesystem.Itthencreatesamaptaskobjectforeachsplit,aswellasanumberofreducetaskobjectsdeterminedbythemapreduce.job.reducesproperty(setbythesetNumReduceTasks()methodonJob).TasksaregivenIDsatthispoint.TheapplicationmastermustdecidehowtorunthetasksthatmakeuptheMapReducejob.Theapplicationmasterrequestscontainersforallthemapandreducetasksinthejobfromtheresourcemanager.Onceataskhasbeenassignedresourcesforacontaineronaparticularnodebytheresourcemanager’sscheduler,theapplicationmasterstartsthecontainerbycontactingthenodemanager.ThetaskisexecutedbyaJavaapplicationwhosemainclassisYarnChild.
ManagingFailures
Therecanbefailuresattheleveloftheentirejoborparticulartasks.Theentireapplicationmasteritselfcouldfail.
Taskfailureusuallyhappenswhentheusercodeinthemaporreducetaskthrowsaruntimeexception.Ifthishappens,thetaskJVMreportstheerrortoitsparentapplicationmaster,whereitisloggedintoerrorlogs.Theapplicationmasterwillthenrescheduleexecutionofthetaskonanotherdatanode.
Theentirejob,i.e.MapReduceapplicationmasterapplicationrunningonYARN,toocanfail.Inthatcase,itisstartedagain,subjecttoamaximumnumberwhichisauser-setconfigurationparameter.
Ifadatanodemanagerfailsbycrashingorrunningveryslowly,itwillstopsendingheartbeatstotheresourcemanager(orsendthemveryinfrequently).Theresourcemanagerwillthenremoveitfromitspoolofnodestoschedulecontainerson.Anytaskorapplicationmasterrunningonthefailednodemanagerwillberecoveredusingerrorlogs,andstartedonothernodes.
ResourceManagerYARNcanalsofail,andithasmoresevereconsequencesfortheentirecluster.Therefore,typically,therewillbeahot-standbyforYARN.Iftheactiveresourcemanagerfails,thenthestandbycantakeoverwithoutasignificantinterruptiontotheclient.Thenewresourcemanagercanreadtheapplicationinformationfromthestatestore,andthenrestarttheapplicationthatwererunningonthecluster.
ShuffleandSort
MapReduceguaranteesthattheinputtoeveryreducerissortedbykey.Theprocessbywhichthesystemperformsthesort—andtransfersthemapoutputstothereducersasinputs—isknownastheshuffle.
Whenthemapfunctionstartsproducingoutput,itisnotdirectlywrittentodisk.Thetakesadvantageofbufferingwritesinmemoryanddoingsomepresortingforefficiencyreasons.Eachmaptaskhasacircularmemorybufferthatitwritestheoutputto.Beforeitwritestodisk,thethreadfirstdividesthedataintopartitionscorrespondingtothereducersthattheywillultimatelybesentto.Withineachpartition,thebackgroundthreadperformsanin-memorysortbykey.Ifthereisacombinerfunction,itisrunontheoutputofthesortsothatthereislessdatatotransfertothereducer.
Thereducetaskneedsthemapoutputforitsparticularpartitionfromseveralmaptasksacrossthecluster.Themaptasksmayfinishatdifferenttimes,sothereducetaskstartsreadingtheiroutputsassoonaseachcompletes.Whenallthemapoutputshavebeenread,thereducetaskmergesthemapoutputs,maintainingtheirsortordering.Thereducefunctionisinvokedforeachkeyinthesortedoutput.TheoutputofthisphaseiswrittendirectlytotheoutputfilesystemsuchasHDFS.
ProgressandStatusUpdates
MapReducejobsarelong-runningbatchjobs,takingalongtimetorun.Itisimportantfor
theusertogetfeedbackonhowthejob’sprogress.Ajobandeachofitstaskshaveastatusvalue(e.g.,running,successfullycompleted,failed),theprogressofmapsandreduces,thevaluesofthejob’scounters.Thesevaluesareconstantlycommunicatedbacktotheclient.Whentheapplicationmasterreceivesanotificationthatthelasttaskforajobiscomplete,itchangesthestatusforthejobto“successful.”Jobstatisticsandcountersarecommunicatedtotheuser.
Hadoopcomeswithanativeweb-basedGUIfortrackingtheMapReducejobs.Itdisplaysusefulinformationaboutajob’sprogresssuchashowmanytaskshavebeencompleted,andwhichonesarestillbeingexecuted.Oncethejobiscompleted,onecanviewthejobstatisticsandlogs.
HadoopStreamingHadoopStreamingusesstandardUnixstreamsastheinterfacebetweenHadoopanduserprogram.Streamingisanidealapplicationfortextprocessing.Mapinputdataispassedoverstandardinputtoyourmapfunction,whichprocessesitlinebylineandwriteslinestostandardoutput.Amapoutputkey-valuepairiswrittenasasingletab-delimitedline.Inputtothereducefunctionisinthesameformat—atab-separatedkey-valuepair—passedoverstandardinput.Thereducefunctionreadslinesfromstandardinput,whichtheframeworkguaranteesaresortedbykey,andwritesitsresultstostandardoutput.
Conclusion
MapReduceisthefirstpopularparallelprogrammingframeworkforBigData.Itworkswellforapplicationswherethedatacanbelarge,anddivisibleintoseparatesets,andrepresentedin<key,value>pairformat.Theapplicationlogicisdividedintotwoparts:aMapprogramandaReduceProgram.Eachoftheseprogramscanberuninparallelbyseveralmachines.
ReviewQuestions
Q1:WhatisMapReduce?Whatareitsbenefits?
Q2:Whatisthekey-valuepairformat?Howisitdifferentfromotherdatastructures?Whatareitsbenefits?Andlimitations.
Chapter6–NoSQLdatabasesANoSQLdatabaseisacleverwaytocost-effectivelyorganizelargeamountsofheterogeneousdataforefficientaccessandupdates.TheidealNoSQLdatabaseiscompletelyalignedwiththenatureoftheproblemsbeingsolved,andissuperfastinthattask.Thisisachievedbyreleasingandrelaxingmanyoftheintegrityandredundancyconstraintsofstoringdatainrelationaldatabases,andstoringdatainmanyinnovativeformatsasalignedwithbusinessneed.ThediverseNoSQLdatabaseswillultimatelycollectiveevolveintoaholisticsetofefficientandelegantdatastructuresattheheartofacosmiccomputerofinfiniteorganizationcapacity.
IntroductionRelationaldatamanagementsystems(RDBMS)areapowerfulanduniversallyuseddatabasetechnologybyalmostallenterprises.Relationaldatabasesarestructuredandoptimizedtoensureaccuracyandconsistencyofdata,whilealsoeliminatinganyredundancyofdata.Thesedatabasesarestoredonthelargestandmostreliableofcomputerstoensurethatthedataisalwaysavailableatagranularlevelandatahighspeed.
Bigdataishoweveramuchlargerandunpredictablestreamofdata.Relationaldatabasesareinadequateforthistask,andwillalsobeveryexpensiveforsuchlargedatavolumes.Managingthecostsandspeedofmanagingsuchlargeandheterogeneousdatastreamsrequiresrelaxingmanyofthestrictrulesandrequirementsofrelationaldata.Dependinguponwhichconstraint(s)arerelaxed,adifferentkindofdatabasestructurewillemerge.ThesearecalledNoSQLdatabases,todifferentiatethemfromrelationaldatabasesthatuseStructuredQueryLanguage(SQL)astheprimarymeanstomanipulatedata.
NoSQLdatabasesarenext-generationdatabasesthatarenon-relationalintheirdesign.ThenameNoSQLismeanttodifferentiateitfromantiquated,‘PRE-relational’databases.Today,almosteveryorganizationthatneedstogathercustomerfeedbackandsentimentstoimprovetheirbusiness,willuseaNoSQLdatabase.NoSQLisusefulwhenanenterpriseneedstoaccess,analyzeandutilizemassiveamountsofeitherstructuredorunstructureddataordatathat’sstoredremotelyinanyvirtualserveracrosstheglobe.
Theconstraintsofarelationaldatabasearerelaxedinmanyways.Forexample,relationaldatabasesrequirethatanydataelementcouldberandomlyaccessedanditsvaluecouldbeupdatedinthatsamephysicallocation.However,thesimplephysicsofstoragesaysthatitissimplerandfastertoreadorwritesequentialblocksofdataonadisk.Therefore,NoSQLdatabasefilesarewrittenonceandalmostneverupdatedinplace.Ifanewversionofapartofthedatabecomeavailable,itwouldbestoredelsewherebythesystem.Thesystemwouldhavetheintelligencetolinktheupdateddatatotheolddata.
PigandHivearetwokeyandpopularlanguagesintheHadoopecosystemthatworkswellonNoSQLdatabases.PigoriginatedatYahoowhileHiveoriginatedatFacebook.BothPigandHivecanusethesamedataasaninput,andcanachievesimilarresultswithqueries.BothPigLatinandHivecommandseventuallycompiletoMapandReducejobs.Theyhaveasimilargoal-toeasethecomplexityofwritingcomplexjavaMapReduceprograms.MostMapReducejobscanbeimplementedeasilyinHiveorPig.
Foranalyticalneeds,HiveispreferableoverPig.Forcontrolledprocessing,Pig’sscriptingdesignispreferableHiveleadstoeaseandproductivityusingitsSQLlikedesignanduserinterface.Pigoffersgreatercontroloverdataflows.JavaMRcanbeusedformoreadvancedAPIstoaccomplishthingswhenthereissomethingspecialneeded,suchasinteractingwithathird-partytool,orsomespecialdatacharacteristics.
RDBMSVsNoSQLTheyaredifferentinmanyways.First,NoSQLdatabases,donotsupportrelationalschemaortheSQLlanguage.ThetermNoSQLstandsmostlyfor“NotonlySQL”.Second,theirtransactionprocessingcapabilitiesarefastbutweak,andtheydonotsupporttheACID(Atomicity,Consistency,Isolation,Durability)propertiesassociatedwithtransactionprocessingusingrelationaldatabases.Instead,theyareapproximatelyaccurateatanypointintime,andwillbeeventuallyconsistent.Third,thesedatabasesarealsodistributedandhorizontallyscalabletomanageweb-scaledatabasesusingHadoopclustersofstorage.Thustheyworkwellwiththewrite-once,read-manystoragemechanismofHadoopclusters.
Feature RDBMS NoSQL
Applications MostlycentralizedApplications(e.g.ERP)
Mostlydesignedforthedecentralizedapplications(e.g.Web,mobile,sensors)
Availability Moderatetohigh Continuousavailabilitytoreceiveandservedata
Velocity Moderatevelocityofdata Highvelocityofdata(devices,sensors,socialmedia,etc.).Lowlatencyofaccess.
DataVolume Moderatesize;archivedafterforacertainperiod
Hugevolumeofdata,storedmostlyforalongtimeorforever;LinearlyscalableDB.
DataSources Dataarrivesfromoneorfew,mostlypredictablesources
Dataarrivesfrommultiplelocationsandareofunpredictablenature
Datatype Dataaremostlystructured Structuredorunstructureddata
DataAccess Primaryconcernisreadingthedata
Concernisbothreadandwrite
Technology Standardizedrelationalschemas;SQLlanguage
Manydesignswithmanyimplementationsofdatastructuresandaccesslanguages
Cost Expensive;commercial Low;open-sourcesoftware
TypesofNoSQLDatabasesThevarietyofbigdatameansthatfilesizeandtypeswillvaryenormously.Therearespecializeddatabasestosuitdifferentpurposes.
1. DocumentDatabases:Storinga10GBvideomoviefileasasingleobjectcouldbespeededupbysequentiallystoringthedataincontiguousblocksofphysicalstorage.Anindexcouldstoretheidentifyinginformationaboutthemovie,andtheaddressofthestartingblock.Therestofstoragedetailscouldbehandledbythesystem.Thisstorageformatwouldbeacalleddocumentstoreformat.Theindexwouldcontainthenameofthemovie,andthevalueistheentirevideofile,characterizedbythefirstblockofstorage.Documentdatabasesaregenerallyusefulforcontentmanagementsystems,bloggingplatforms,webanalytics,real-timeanalytics,ecommerce-applications.Wewouldavoidusingdocumentdatabasesforsystemsthatneedcomplextransactionsspanningmultipleoperationsorqueriesagainstvaryingaggregatestructures.
2. Key-ValuePairDatabases:Therecouldbeacollectionofmanydataelementssuchasacollectionoftextmessageswhichcouldalsofitintoasinglephysicalblockofstorage.Eachtextmessageisauniqueobject.Thisdatawouldneedtobequeriedoften.Thatcollectionofmessagescouldalsobestoredinakey-valuepairformat,bycombiningtheidentifierofthemessageandthecontentofthemessage.Key-valuedatabasesareusefulforstoringsessioninformation,userprofiles,preferences,andshoppingcartdata.Key-valuedatabasesdon’tworksowellwhenweneedtoquerybynon-keyfieldsoronmultiplekeyfieldsatthesametime.
3. GraphDatabases:Geographicmapdatathatisstoredinsetofrelationshipsorlinksbetweenpoints.Graphdatabasesareverywellsuitedtoproblemspaceswherewehaveconnecteddata,suchassocialnetworks,spatialdata,routinginformation,andrecommendationengines.
4. ColumnarDatabases:Somekindofdatabasesareneededtospeedupsomeoft-soughtqueriesfromverylargedatasets.Supposethereisanextremelylargedatawarehouseofweblogaccessdata,whichisrolledupbythenumberofwebaccessbythehour.Thisneedstobequeried,orsummarizedoften,involvingonlysomeofthedatafieldsfromthedatabase.Thusthequerycouldbespeededupbycreatingadatabasestructurethatincludedonlytherelevantcolumnsofthedataset,alongwiththekeyidentifyinginformation.Thisiscalledacolumnardatabaseformat,andisusefulforcontentmanagementsystems,bloggingplatforms,maintainingcounters,expiringusage,heavywritevolumesuchaslog
aggregation.Columnfamilydatabasesforsystemswellwhenthequerypatternshavestabilized.
ThechoiceofNoSQLdatabasedependsonthesystemrequirements.Thereareatleast200implementationsofNoSQLdatabasesofthesefourtypes.Visitnosql-database.orgformore.
Despitethename,aNoSQLdatabasedoesnotnecessarilyprohibitstructuredquerylanguage(likeMySQL).WhilesomeoftheNoSQLsystemsareentirelynon-relational,othersjustavoidsomeselectedfunctionalityofRDMSsuchasfixedtableschemasandjoinoperations.ForNoSQLsystems,insteadofusingtables,thedatacanbeorganizedthedatainkey/valuepairformat,andthenSQLcanbeused.
ThefirstpopularNoSQLdatabasewasHBase,whichisapartoftheHadoopfamily.ThemostpopularNoSQLdatabaseusedtodayisApacheCassandra,whichwasdevelopedandownedbyFacebooktillitwasreleasedasopensourcein2008.OtherNoSQLdatabasesystemsareSimpleDB,Google’sBigTable,MemcacheDB,OracleNoSQL,Voldemort,etc.
ArchitectureofNoSQL
Figure6‑0‑1:NoSQLDatabasesArchitecture
OneofthekeyconceptsunderlyingtheNoSQLdatabasesisthatdatabasemanagementhasmovedtoatwo-layerarchitecture;separatingtheconcernsofdatamodelinganddatastorage.Thedatastoragelayerfocusesonthetaskofhigh-performancescalabledatastorageforthetaskathand.Thedatamanagementlayeravarietyofdatabaseformats,andallowsforlow-levelaccesstothatdatathroughspecializedlanguagesthataremoreappropriateforthejob,ratherthanbeingconstrainedbyusingthestandardSQLformat.
NoSQLdatabasesmapsthedatainthekey/valuepairsandsavesthedatainthestorageunit.Thereisnostorageofdatainacentralizedtabularform,sothedatabaseishighlyscalable.Thedatacouldbeofdifferentforms,andcomingfromdifferentsources,andtheycanallbestoredinsimilarkey/valuepairformats.
ThereareavarietyofNoSQLarchitectures.SomepopularNoSQLdatabaseslikeMongoDBaredesignedinamaster/slavemodellikemanyRDBMS.ButotherpopularNoSQLdatabaseslikeCassandraaredesignedinamaster-lessfashionwhereallthenodesintheclustersarethesame.So,itisthearchitectureoftheNoSQLdatabasesystemthatdeterminesthebenefitsofdistributedandscalablesystememergeslikecontinuousavailability,distributedaccess,highspeed,andsoon.
NoSQLdatabasesprovidedeveloperslotofoptionstochoosefromandfinetunethesystemtotheirspecificrequirements.Understandingtherequirementsofhowthedataisgoingtobeconsumedbythesystem,questionssuchasisitreadheavyvswriteheavy,isthereaneedtoquerydatawithrandomqueryparameters,willthesystembeablehandleinconsistentdata.
CAPtheoremDataisexpectedtobeaccurateandavailable.Inadistributedenvironment,accuracydependsupontheconsistencyofdata.AsystemisconsideredConsistentifallreplicasofcopycontainthesamevalue.ThesystemisconsideredAvailable,ifthedataIisavailableatallpointsintime.Itisalsodesirableforthedatatobeconsistentandavailableevenwhenanetworkfailurerendersthedatabasepartitionedintotwoormoreislands.Asystemisconsideredpartitiontolerantifprocessingcancontinueinbothpartitionsinthecaseofanetworkfailure.Inpracticeitishardtoachieveallthree.
ThechoicebetweenConsistencyandAvailabilityremainstheunavoidablerealityfordistributeddatastores.CAPtheoremstatesthatinanydistributedsystemonecanchooseonlytwooutofthethree(Consistency,AvailabilityandPartitionTolerance).Thethirdwillbedeterminedbythosechoices.
NoSQLdatabasescanbetunedtosuitone’schoiceofhighconsistencyoravailability.Forexample,foraNoSQLdatabase,thereareessentiallythreeparameters:
-N=replicationfactor,i.e.thenumberofreplicascreatedforeachpieceofdata
-R=Minimumnumberofnodesthatshouldrespondtoareadrequestforittobeconsideredsuccessful
-W=Minimumnumberofnodesthatshouldrespondtoawriterequestbeforeitsconsideredsuccessful.
SettingthevaluesofRandWveryhigh(R=N,andW=N)willmakethesystemmoreconsistent.However,itwillbeslowtoreportConsistency,andthusAvailabilitywillbelow.Ontheotherend,settingRandWtobeverylow(suchasR=1andW=1),wouldmaketheclusterhighlyavailable,asevenasinglesuccessfulread(orwrite)wouldlettheclustertoreportsuccess.However,consistencyofdataontheclusterwillbelowsincemanyofthemaynothaveyetreceivedthelatestcopyofthedata.
Ifanetworkgetspartitionedbecauseofanetworkfailure,thenonehastotradeoffavailabilityversusconsistency.NoSQLdatabaseusersoftenchooseavailabilityandpartitiontoleranceoverstrongconsistency.Theyarguethatshortperiodsofapplicationmisbehaviorarelessproblematicthanshortperiodsofunavailability.
Consistencyismoreexpensiveintermsofthroughputorlatency,thanisAvailability.However,HDFSchoosesconsistency–asthreefaileddatanodescanpotentiallyrendera
file’sblockscompletelyunavailable.
PopularNoSQLDatabasesWecovertwoofthemorepopularofferings.
HBaseApacheHBaseisacolumn-oriented,non-relational,distributeddatabasesystemthatrunsontopofHDFS.AnHBasesystemcomprisesasetoftables.Eachtablecontainsrowsandcolumns,muchlikeatraditionaldatabase.EachtablemusthaveanelementdefinedasaPrimaryKey;allaccesstoHBasetablesisdoneusingthePrimaryKey.AnHBasecolumnrepresentsanattributeofanobject.Forexample,ifthetableisstoringdiagnosticlogsfromwebservers,eachrowwillbealogrecord.Eachcolumninthattablewillrepresentanattributesuchasthedate/timeoftherecord,ortheservername.HBasepermitsmanyattributestobegroupedtogetherintoacolumnfamily,sothatallelementsofacolumnfamilyareallstoredasessentiallyacompositeattribute.
Columnardatabasesaredifferentfromarelationaldatabaseintermsofhowthedataisstored.Intherelationaldatabase,allthecolumns/attributesofagivenrowarestoredtogether.WithHBaseyoumustpredefinethetableschemaandspecifythecolumnfamilies.Allrowsofacolumnfamilywillstoredsequentially.However,it’sveryflexibleinthatnewcolumnscanbeaddedtofamiliesatanytime,makingtheschemaflexibleandthereforeabletoadapttochangingapplicationrequirements.
ArchitectureOverview
HBaseisbuiltonmaster-slaveconcept.InHBaseamasternodemanagesthecluster,whiletheworkernodes(calledregionservers)storeportionsofthetablesandperformtheworkonthedata.HBaseisdesignedafterGoogleBigtable,andofferssimilarcapabilitiesontopofHadoopandHDFS.Itdoesconsistentreadsandwrites.Itdoesautomaticandconfigurableshardingoftables.Ashardisasegmentofthedatabase.
Figure6‑0‑2:HBASEArchitecture
Physically,HBaseiscomposedofthreetypesofserversinamasterslavetypeofarchitecture.
(a)TheNameNodemaintainsmetadatainformationforallthephysicaldatablocksthatcomprisethefiles.
(b)Regionserversservedataforreadsandwrites.
(c)TheHadoopDataNodestoresthedatathattheRegionServerismanaging.
HBaseTablesaredividedhorizontallybyrowkeyrangeinto“Regions.”Aregioncontainsallrowsinthetablebetweentheregion’sstartkeyandendkey.Regionassignment,DDL(create,deletetables)operationsarehandledbytheHBaseMasterprocess.Zookeeper,whichispartofHDFS,maintainsaliveclusterstate.ThereisanautomaticfailoversupportbetweenRegionServers.AllHBasedataisstoredinHDFSfiles.RegionServersarecollocatedwiththeHDFSDataNodes,whichenabledatalocality(puttingthedataclosetowhereitisneeded)forthedataservedbytheRegionServers.HBasedataislocalwhenitiswritten,butwhenaregionismoved,itisnotlocaluntilcompaction.
EachRegionServercreatesanephemeralnode.TheHMastermonitorsthesenodestodiscoveravailableregionservers,anditalsomonitorsthesenodesforserverfailures.
Amasterisresponsibleforcoordinatingtheregionservers,includingassigningregionsonstartup,loadbalancingofrecoveryamongregions,andmonitoringtheirhealth.Itisalsotheinterfaceforcreating,deleting,updatingtables
ReadingandWritingData
ThereisaspecialHBaseCatalogtablecalledtheMETAtable,whichholdsthelocationoftheregionsinthecluster.ZooKeeperstoresthelocationoftheMETAtable.
ThisiswhathappensthefirsttimeaclientreadsorwritestoHBase:
TheclientgetstheRegionserverthathoststheMETAtablefromZooKeeper.
Theclientwillquerythe.META.servertogettheregionservercorrespondingtotherowkeyitwantstoaccess.TheclientcachesthisinformationalongwiththeMETAtablelocation.
ItwillgettheRowfromthecorrespondingRegionServer.
Forfuturereads,theclientusesthecachetoretrievetheMETAlocationandpreviouslyreadrowkeys.Overtime,itdoesnotneedtoquerytheMETAtable,unlessthereisamissbecausearegionhasmoved;thenitwillre-queryandupdatethecache.
CassandraApacheCassandraisalargelyscalableopensourcenon-relationaldatabasethatofferscontinuousuptime,simplicityandeasydatadistributionacrossmultipledatacentersandcloud.CassandrawasoriginallydevelopedatFacebookandwasopensourcedin2008.Itprovidesmanybenefitsoverthetraditionalrelationaldatabasesformodernonlineapplicationslikescalablearchitecture,continuousavailability,highdataprotection,multidatareplicationsoverdatacenters,datacompression,SQLlikelanguageandsoon.
ArchitectureOverview
Cassandraarchitectureprovidesitsabilitytoscaleandprovidecontinuousavailability.Ratherthanusingmaster-slavearchitecture,ithasamaster-less“ring”designthatiseasytosetupandmaintain.InCassandra,allnodesplayanequalrole,allnodescommunicatewithoneanotherbyadistributedandhighlyscalableprotocolcalledgossip.
So,theCassandrascalablearchitectureprovidesthecapacityofhandlinglargevolumeofdata,andlargenumberofconcurrentusersoroperationsoccurringatthesametime,acrossmultipledatacenters,justaseasilyasanormaloperationfortherelationaldatabases.Toenhanceitscapacity,onesimplyneedstoaddnewnodestoanexistingclusterwithouttakingdownthesystemanddesigningfromthescratch.
AlsotheCassandraarchitecturemeansthatunlikeothermasterslavesystems,ithasnosinglepointoffailureandthusiscapableofofferingcontinuousavailabilityanduptime.
ReadingandWritingData
DatatobewrittentoaCassandranodeisfirstrecordedinanondiskcommitlogandthenitiswrittentoamemorybasedunitcalleda“memTable”.Whena“memTable”sizeexceedsacertainsetthreshold,thedataisthenwrittentofileondiskcalledan“SSTable”.Thus,inthiswaythewriteoperationisfullysequentialinnature.withmanyinputoutputoperationoccurringatthesametime,ratherthanoccurringoneatatimeoveralongperiod.
Forareadoperation,Cassandralooksinaninmemorydatastructurecalleda“Bloomfilter”thatfetchtheprobabilityofa“SSTable”havingtherequireddata.TheBloomfiltercanperformthetaskveryquicklytotellifafilehastheneededdataornot.IfitreturntruethenCassandralooksforanotherlayerofinmemorycaches,andthenfetchesthecompresseddataondisk.Iftheanswerisfalse,Cassandradoesn’tbotherwithreadingthe“SSTable”andlooksforanotherfiletofetchtherequireddata.
WriteSyntax:TTransporttr=newTSocket(HOST,PORT);
TFramedTransporttf=newTFramedTransport(tr);TProtocolprotocal=newTBinaryProtocol(tf);Cassandra.Clientclient=newCassandra.Client(protocal);
tf.open();
client.insert(userIDKey,cp,newColumn(“Colume-name”.getBytes(UTF8),“Colume-data”.getBytes(),clock),CL);
ReadSyntax:
Columncol=client.get(userIDKey,colPathName,CL).getColumn();
LOG.debug(“Columnname:”+newString(col.Colume-name,UTF8));
LOG.debug(“Columnvalue:”+newS tring(col.Colume-data,UTF8));
HiveLanguageHiveisadeclarativeSQL-likelanguageforqueries.HivewasdesignedtoappealtoacommunitycomfortablewithSQL.Itisusedmainlybydataanalystsontheserverside,fordesigningreports.Ithasitsownmetadatasectionwhichcanbedefinedaheadoftime,beforedataisloaded.Hivesupportsmapandreducetransformscriptsinthelanguageoftheuser’schoice,whichcanbeembeddedwithinSQLclauses.ItiswidelyusedinFacebookbyanalystscomfortablewithSQL,aswellasbydataminersprogramminginPython.Hiveisbestusedfortraditionaldatawarehousingtasks;itisnotdesignedforonlinetransactionprocessing.
Hiveisbestsuitedforstructureddata.HivecanbeusedtoquerydatastoredinHbase,whichisakey-valuestore.Hive’sSQL-likestructuremakestransformationofdatatoandfromRDBMSiseasier.SupportingSQLsyntaxalsomakesiteasytointegratewithexistingBItools.Hiveneedsthedatatobefirstimported(orloaded)andafterthatitcanbeworkedupon.Incaseofstreamingdata,onewouldhavetokeepfillingbuckets(orfiles),andthenHivecanbeusedtoprocesseachfilledbucket,whileusingotherbucketstokeepstoringthenewlyarrivingdata.
HivedataColumnsaremappedtotablesinHDFS.ThismappingisstoredinMetadata.AllHQLqueriesareconvertedtoMapReducejobs.Atablecanhaveonemorepartitionkeys.ThereareusualSQLdatatypes,andArraysandMapsandStructstorepresentmorecomplextypesofdata.Thereareuserdefinedfunctionsformapping,aggregating
Figure6‑3:HiveArchitecture
HIVELanguageCapabilities
Hive’sSQLprovidesalmostallbasicSQLoperations.Theseoperationsworkontablesandorpartitions.Theseoperationsare:SELECT,FROM,WHERE,JOIN,GROUPBY,
ORDERBY.Italsoallowstheresultstobestoredinanothertable,orinaHDFSfile.
Thestatementtocreateapage_viewtablewouldbelike:
CREATETABLEpage_view(viewTimeINT,useridBIGINT,
page_urlSTRING,referrer_urlSTRING,
ipSTRINGCOMMENT‘IPAddressoftheUser’)
COMMENT‘Thisisthepageviewtable’
PARTITIONEDBY(dtSTRING,countrySTRING)
STOREDASSEQUENCEFILE;
Hereisascriptforloadingdataintothisfile.
CREATEEXTERNALTABLEpage_view_stg(viewTimeINT,useridBIGINT,
page_urlSTRING,referrer_urlSTRING,
ipSTRINGCOMMENT‘IPAddressoftheUser’,
countrySTRINGCOMMENT‘countryoforigination’)
COMMENT‘Thisisthestagingpageviewtable’
ROWFORMATDELIMITEDFIELDSTERMINATEDBY‘44’LINESTERMINATEDBY‘12’
STOREDASTEXTFILE
LOCATION‘/user/data/staging/page_view’;
ThetablecreatedabovecanbestoredinHDFSasaTextFileorasaSequenceFile.
AnINSERTqueryonthistablewilllooklike:
hadoopdfs-put/tmp/pv_2008-06-08.txt/user/data/staging/page_view
FROMpage_view_stgpvs
INSERTOVERWRITETABLEpage_viewPARTITION(dt=‘2008-06-08’,country=‘US’)
SELECTpvs.viewTime,pvs.userid,pvs.page_url,pvs.referrer_url,null,null,pvs.ip
WHEREpvs.country=‘US’;
PigLanguagePigisahigh-levelprocedurallanguage.Itisusedmainlyforprogramming.Ithelpstocreateastep-by-stepflowofdatatodoprocessing.Itoperatesmostlyontheclientsideofthecluster.PigLatinfollowsaprocedureprogrammingmodelandmorenaturaltousetobuildadatapipeline,suchasETLjob.Itgivesfullcontroloverhowthedataflowsthroughthepipeline,whentocheckpointthedatainpipeline,anditsupportDAGsinpipelinesuchassplit,andgivesmorecontroloveroptimization.Pigworkswellwithunstructureddata.Forcomplexoperationssuchasanalyzingmatrices,orsearchforpatternsinunstructureddata,Pigwillgivegreatercontrolandoptions.
Pigallowsonetoloaddataandusercodeatanypointinthepipeline.Thiscanbeimportantforingestingstreamingdatafromsatellitesorinstruments.Pigalsouseslazyevaluation.PigisfasterinthedataimportbutslowerinactualexecutionthananRDBMSfriendlylanguagelikeHive.Pigiswellsuitedtoparallelizationandsoitisbettersuitedforverylargedatasetsthroughput(amountofdataprocessed)ismoreimportantthanlatency(speedofresponse).
PigisSQL-like,butdifferstoagreatextent.Itdoesnothaveadedicatedmetadatasection;theschemawillhavetobedefinedintheprogramitself.Itis.PigcanbeeasierforsomeonewhohadnotearlierexperiencewithSQL.
ConclusionNoSQLdatabasesemergedinresponsetothelimitationsofrelationaldatabasesinhandlingthesheervolume,natureandgrowthofdata.NoSQLdatabaseshavethefunctionalitylikeMapReduce.NoSQLdatabaseisprovingtobeaviablesolutiontotheenterprisedataneedsandcontinuetodoso.TherearefourtypesofNoSQLdatabases:columnar,Key-pair,document,andgraphicaldatabases.CassandraandHBaseareamongthemostpopularNOSQLdatabases.HiveisanSQL-typelanguagetoaccessdatafromNoSQLdatabases.Pigisaproceduralhigh-languagethatgivesgreatercontroloverdataflows.
ReviewQuestionsQ1:WhatisaNoSQLdatabase?Whatarethedifferenttypesofit?
Q2:HowdoesaNoSQLdatabaseleveragethepowerofMapReduce?
Q3:whatarethekindsofNoSQLdatabases?Whataretheadvantagesofeach?
Q3:WhatarethesimilaritiesanddifferencesbetweenHiveandPig?
Chapter7–StreamProcessingwithSparkAstreamprocessingsystemisacleverwaytoprocesslargequantitiesofdatafromavastsetofextremelyfastincomingdatastreams.Theidealstreamprocessingenginewillcaptureandreportinrealtimetheessenceofalldatastreams,nomatterthespeedorsizeofnumber.Thisisachievedbyusinginnovativealgorithmsandfiltersthatrelaxmanycomputationalaccuracyrequirements,tocomputesimpleapproximatemetricsinrealtime.Streamprocessingenginealignswiththeinfinitedynamismoftheflowofnature.
IntroductionApacheSparkisanintegrated,fast,in-memory,general-purposeengineforlarge-scaledataprocessing.Sparkisidealforiterativeandinteractiveprocessingtasksonlargedatasetsandstreams.Sparkachieves10-100xperformanceoverHadoopbyoperatingwithanin-memoryconstructcalled‘ResilientDistributedDatasets’,whichhelpavoidthelatenciesinvolvedindiskreadsandwrites.WhileSparkiscompatiblewithHadoopfilesystemsandtools,alargescaleadoptionofSparkanditsbuilt-inlibraries(forMachineLearning,GraphProcessing,Streamprocessing,SQL)willdeliverseamlessfastdataprocessingalongwithhighprogrammerproductivity.SparkhasbecomeamoreefficientandproductivealternativeforHadoopecosystem,andisincreasingbeingusedinindustry.
ApacheSparkwasoriginallydevelopedin2009inUCBerkeley’sAMPLab,andopensourcedin2010asanApacheproject.Itcanprocessdatafromavarietyofdatarepositories,includingtheHadoopDistributedFileSystem(HDFS),andNoSQLdatabasessuchasHBaseandCassandra.Sparksupportsin-memoryprocessingtoboosttheperformanceofbigdataanalyticsapplications,butitcanalsodoconventionaldisk-basedprocessingwhendatasetsaretoolargetofitintotheavailablesystemmemory.Sparkgivesusacomprehensive,unifiedframeworktomanagebigdataprocessingrequirementswithavarietyofdatasetsthatarediverseinnature(textdata,graphdataetc)aswellasthesourceofdata(batchv.real-timestreamingdata).SparkenablesapplicationsinHadoopclusterstorunupto100timesfasterinmemoryand10timesfasterevenwhenrunningondisk.SparkisanalternativetoHadoopMapReduceratherthanareplacementforHadoop.Itprovidesacomprehensiveandunifiedsolutiontomanagedifferentbigdatausecasesandrequirements.
SparkArchitecture
ThecoreSparkenginefunctionspartlyasanapplicationprogramminginterface(API)layerandunderpinsasetofrelatedtoolsformanagingandanalyzingdata,includingaSQLqueryengine,alibraryofmachinelearningalgorithms,agraphprocessingsystemandstreamingdataprocessingsoftware.Sparkallowsprogrammerstodevelopcomplex,multi-stepdatapipelinesusingdirectedacyclicgraph(DAG)pattern.Italsosupportsin-memorydatasharingacrossDAGs,sothatdifferentjobscanworkwiththesamedata.SparkrunsontopofexistingHadoopDistributedFileSystem(HDFS)infrastructuretoprovideenhancedandadditionalfunctionality.ItprovidessupportfordeployingSparkapplicationsinanexistingHadoopv1cluster(withSIMR–Spark-Inside-MapReduce)orHadoopv2YARNclusterorevenApacheMesos.
Nextwewillintroducethetwoimportancefeaturesinspark:RDDsandDAG.
ResilientDistributedDatasets(RDD)
RDD,ResilientDistributedDatasets,isadistributedmemorydistribution.Theyaremotivatedbytwotypesofapplicationsthatcurrentcomputingframeworkshandleinefficiently:iterativealgorithmsandinteractivedataminingtools.Inbothcases,keepingdatainmemorycanimproveperformancebyanorderofmagnitude.
RDDsareImmutableandpartitionedcollectionofrecords,whichcanonlybecreatedbycoarsegrainedoperationssuchasmap,filter,groupbyetc.Bycoarsegrainedoperations,itmeansthattheoperationsareappliedonallelementsinadataset.RDDscanonlybecreatedbyreadingdatafromastablestoragesuchasHDFSorbytransformationsonexistingRDDs.
OncedataisreadintoanRDDobjectinSpark,avarietyofoperationscanbeperformedbycallingabstractSparkAPIs.Thetwomajortypesofoperationavailablearetransformationsandactions.Transformationsreturnanew,modifiedRDDbasedontheoriginal.SeveraltransformationsareavailablethroughtheSparkAPI,includingmap(),
filter(),sample(),andunion().ActionsreturnavaluebasedonsomecomputationbeingperformedonanRDD.SomeexamplesofactionssupportedbytheSparkAPIincludereduce(),count(),first(),andforeach().
DirectedAcyclicGraph(DAG)
DAGrefersadirectedacyclicgraph.Thisapproachisanimportantfeatureforreal-timenigDataplatforms.Thosetools,includingStorm,Spark,andTez,offeramazingnewcapabilitiesforbuildinghighlyinteractive,real-timecomputingsystemstopoweryourreal-timeBI,predictiveanalytics,real-timemarketingandothercriticalsystems.
DAGScheduleristheschedulinglayerofApacheSparkthatimplementsstage-orientedscheduling,i.e.afteranRDDactionhasbeencalleditbecomesajobthatisthentransformedintoasetofstagesthataresubmittedasTaskSetsforexecution.Ingeneral,DAGSchedulerdoesthreethingsinSpark:ComputesanexecutionDAG,i.e.DAGofstages,forajob;Determinesthepreferredlocationstoruneachtaskon;Handlesfailuresduetoshuffleoutputfilesbeinglost.
SparkEcosystemSparkisanintegratedstackoftoolsresponsibleforscheduling,distributing,andmonitoringapplicationsconsistingofmanycomputationaltasksacrossmanyworkermachines,oracomputingcluster.SparkiswrittenprimarilyinScala,butincludescodefromPython,Java,R,andotherlanguages.Sparkcomeswithasetofintergratedtoolsthatreducelearningtimeanddeliverhigheruserproductivity.SparkecosystemincludesMesosresourcemanager,andothertools.
SparkhasalreadyovertakenHadoopingeneralbecauseofbenefitsitprovidesintermsoffasterexecutioniniterativeprocessingalgorithms.
SparkforbigdataprocessingSparksupportbigdataminingthroughrelevantlibrariesincludingMLlib,GraphXandSparkR.AndthroughSparkSQLlanguageandStreaminglibrary.
MLlib
MLlibisSpark’smachinelearninglibrary.Itconsistsofbasicmachinelearningalgorithmssuchasclassification,regression,clustering,collaborativefiltering,dimensionalityreduction,aswellaslower-leveloptimizationprimitivesandhigher-levelpipelineAPIs.Atthesametime,wecareaboutalgorithmicperformance.Sparkexcelsatiterativecomputation,enablingMLlibtorunfast.SoMLlibalsocontainshigh-qualityalgorithmsthatleverageiteration,andcanyieldbetterresultsthantheone-passapproximationssometimesusedonMapReduce.Inaddition,SparkMLlibiseasytouseanditcansupportscala,Java,Python,andSparkR.
Forexample,Decisiontreesisapopulardataclassificationtechnique,SparkMLlibcansupportdecisiontreesforbinaryandmulticlassclassification,usingbothcontinuousandcategoricalfeatures.Theimplementationpartitionsdatabyrows,allowingdistributedtrainingwithmillionsofinstances.
FunctionsinDecisionTrees
class:publicstaticDecisionTreeModeltrainClassifier(…)
Methodtotrainadecisiontreemodelforbinaryormulticlassclassification.
Parameters:
•input-Trainingdataset:RDDofLabeledPoint.Labelsshouldtakevalues{0,1,…,numClasses-1}.
•numClassesForClassification-numberofclassesforclassification.
•categoricalFeaturesInfo-Mapstoringarityofcategoricalfeatures.
•impurity-Criterionusedforinformationgaincalculation.Supportedvalues:“gini”or“entropy”
•maxDepth-Maximumdepthofthetree.(suggestedvalue:4).
•maxBins-maximumnumberofbinsusedforsplittingfeatures(suggestedvalue:100).
Returns:DecisionTreeModelthatcanbeusedforprediction
SparkGraphX
Efficientprocessingoflargegraphsisanotherimportantandchallengingissue.Many
practicalcomputingproblemsconcernlargegraphs.Forexample,googlehavetorunitsPageRankonbillionsofwebpagesandmaybetrillionsofweblinks.GraphXisanewcomponentinSparkforgraphsandgraph-parallelcomputation.Atahighlevel,GraphXextendstheSparkRDDbyintroducinganewGraphabstraction:adirectedmulti-graphwithpropertiesattachedtoeachvertexandedge.
Tosupportgraphcomputation,GraphXexposesasetoffundamentaloperatorssuchassubgraph,joinVertices,andaggregateMessagesonthebaissofanoptimizedvariantofthePregelAPI(PregelisthesystematGooglethatpowersPageRank).Inaddition,GraphXincludesagrowingcollectionofgraphalgorithmsandbuilderstosimplifygraphanalyticstasks.
WecomputethePageRankofeachuserasfollows:
//loadtheedgesasagraphobject
valgraph=GraphLoader.edgeListFile(sc,“outlink.txt”)
//Runpagerank
valranks=graph.pagerank(0.00000001).vertices
//jointherankwiththewebpages
valpages=sc.textFile(“pages.txt”).map{line=>valfields=line.split(“,”)(fields(0).toLong,fields(1))}
valranksByPagename=pages.join(ranks).map{case(id,(pagename,rank))=>(pagename,rank)}
//printtheoutput
println(rankByPagename.collect().mkString(“\n”))
SparkR
Risapopularstatisticalprogramminglanguagewithanumberofextensionsthatsupportdataprocessingandmachinelearningtasks.However,interactivedataanalysisinRisusuallylimitedastheruntimeissingle-threadedandcanonlyprocessdatasetsthatfitinasinglemachine’smemory.SparkR,anRpackageinitiallydevelopedattheAMPLab,canprovideanRfrontendtoApacheSparkandusingSpark’sdistributedcomputationengineallowsustorunlargescaledataanalysisfromtheRshell.SparkRexposestheRDDAPIofSparkasdistributedlistsinR.Forexample,onecanreadaninputfilefromHDFSandprocesseverylineusinglapplyonaRDD.Thereisacaseletasfollows:
sc<-sparkR.init(“local”)
lines<-textFile(sc,“hdfs://data.txt”)
wordsPerLine<-lapply(lines,function(line)){length(unlist(strsplit(line,””)))})
Inadditiontolapply,SparkRalsoallowsclosurestobeappliedoneverypartitionusinglapplyWithPartition.OthersupportedRDDfunctionsincludeoperationslikereduce,reduceByKey,groupByKeyandcollect.
SparkSQL
SparkSQLisalanguageprovidedtodealwiththestructureddata.Usingthisonecanrunqueriesonthedataandgetsomemeaningfulresult.ItsupportsthequeriesthroughSQLaswellasHQL(HiveQueryLanguage)whichisApache’sHiveversionofSQL.
SparkStreaming
SparkStreaminggainsdatastreamsfrominputsources,processtheminacluster,pushouttodatabases/dashboards.Sparkfurtherchopsupdatastreamsintobatchesoffewseconds.SparktreatseachbatchofdataasRDDsandprocessesthemusingRDDoperations.Theprocessedresultsarepushedoutasbatches.
SparkapplicationsSomehotdataproblemsthataresolvedwellbyatoollikeApacheSparkinclude:1.Real-timeLogDatamonitoring.2.MassiveNaturalLanguageProcessing3.LargeScaleOnlineRecommendationSystems.
AsimpleWordcountapplicationcanberuninSparkshellasbelow.
valtextFile=sc.textFile(“C:\Users\MyName\Documents\obamaSpeech.txt”)
***Comment:savesthetextfileastextFile***
valcounts=textFile.flatMap(line=>line.split(”“)).map(word=>(word,1)).reduceByKey(_+_)
***Comment:Calculatethetotalwordsbysplittingwithspace***
counts.count();
***Resultstheoutputasbelow******
Long=52
counts.saveAsTextFile(“C:\Users\MyName\Desktop\counts1”)
***Comment:savesthefileonmyDesktop***
SparkvsHadoopSparkandHadooparebothpopularApacheprojectsdedicatedtobigdataprocessing.Hadoop,formanyyears,wastheleadingopensourcebigdataplatformandmanycompaniesalreadyuseadistributedcomputingframeworklikeHadoopbasedonMapReduce.Table9.1providesasummaryofthedifferencesbetweenHadoopandSpark.
Feature Hadoop Spark
Purpose Resilientcost-effectivestorageandprocessingoflargedatasets
Fastgeneral-purposeengineforlarge-scaledataprocessing
Corecomponent HadoopDistributedFilesystem(HDFS)
SparkCore,thein-memoryprocessingengine.
Storage HDFSmanagesmassivedatacollectionsacrossmultiplenodeswithinaclusterofcommodityservers.
Spark doesn’t do distributedstorage. It operates ondistributeddatacollections.
FaultTolerance Hadoop uses replication toachievefaulttolerance.
SparkusesRDDforfaulttolerancethatminimizesnetworkI/O.
Natureofprocessing
AccompaniedbyMapReduce,itincludesbatchprocessingofthisdatainparallelmode
Batch as well as streamprocessing.
SweetspotBatchprocessing
Iterativeandinteractiveprocessingjobs,thatcanfitinthememory
ProcessingSpeedMapReduceisslow.
Sparkcanbeupto10xfasterthanMapReduceforbatchprocessingandupto100xfasterforstreamprocessing.
Security Moresecure Lesssecure
Failurerecovery Hadoopcanrecoverfromsystemfaultsorfailuressincedataarewrittentodiskaftereveryoperation
WithSpark,dataobjectsarestoredinRDD.Thesecanbereconstructedafterfaultsorfailures
Analyticstools Built-inMLLib(Machine
Separateengine Learning)andGraphX(GraphProcessing)libraries
Compatibility PrimarystoragemodelisHDFS CompatibilitywithHDFSandotherstorageformats
Languagesupport Java Scalaisnativelanguage.APIsforpython,java,R,others.
DrivingOrganization Yahoo AMPLabsfromUCBerkeley
Technologyowners Apache,Open-source,free Open-source,free
KeyDistributors Cloudera,Horton,MapR Databricks,AMPLabs
CostofSystem MediumtoHigh MediumtoHigh
ConclusionSparkisanewintegratedsystemforbigdataprocessing.ItsmostimportantcoreabstractionisRDDs,alongwithrelevantlibrarieslikeMLlibandGraphX.Sparkisareallypowerfulopensourceprocessingenginebuildaroundspeed,easeofuse,andsophisticatedanalytics.
ReviewQuestionsQ1:Describethesparkecosystem.
Q2:CompareSparkandHadoopintermsoftheirabilitytodostreamcomputing?
Q3:WhatisanRDD?HowdoesitmakeSparkfaster?
Q4:DescribethreemajorcapabilitiesinSparkfordataanalytics.
Chapter8–IngestingDataWholenessADataingestingsystemisareliableandefficientpointofreceptionforalldatacomingintoasystem.Thissystemisdesignedtobeflexibleandscalabletoreceivedatafromvarioussources,atvarioustimesandspeedsandquantities.Theingestsystemmakesthedataavailableforusebythetargetapplicationsinrealtime.Ideally,alldatawouldbesmoothlyreceived,andmadeavailablefordownstreamapplicationstosecurelyandreliablyaccessattheirownconvenience.Adedicatingdataingestmechanismisachievedbycreatingafastandflexiblebufferforreceivingandstoringallincomingstreamsofdata.Thedatainthebufferisstoredinasequentialmanner,andismadeavailabletoallconsumingapplicationsinafastandorderlymanner.
BigDataarrivesintoasystematunpredictablespeedsandquantities.Businessapplicationsthereafterreceiveandprocessthisdataatsomeplannedthroughputcapacity.Aningestbufferisneededtocommunicatethedatawithoutlossofdataorspeed.Thisbufferideahashistoricallybeencalledamessagingsystem,nottoodissimilarfromamailboxsystematthepostoffice.Incomingmessagesareputintoasetoforganizedlocations,fromwherethetargetapplicationswouldreceivethemwhentheyareready.
Withhugeamountsofdatacominginfromdifferentsources,andmanymoreconsumingapplications,apoint-to-pointsystemofdeliveringmessagesbecomesinadequateandslow.Alternatively,incomingdatacanbecategorizedintocertaintopics,andstoredintherespectivelocationorlocationsforthosetopics.Insteadofdatabeingreceivedandheldinstorageforaspecifictargetapplication,nowthedatamaybeconsumedbyanyapplicationthatisinterestedindatarelatedtoatopic.Eachconsumingapplicationcanchoosetoreaddataaboutoneormoretopicsofitsinterest.Thisiscalledthepublish-and-subscribesystem.
MessagingSystemsAMessagingSystemisanasynchronousmodeofcommunicatingdatabetweenapplication.Therearetwogenerickindsofmessagingsystems−apoint-to-pointsystem,andapublish-subscribe(pub-sub)system.Mostofthemessagingpatternsnowfollowpub-submodel.
PointtoPointMessagingSystem
Inapoint-to-pointsystem,everymessageisdirectedataparticularreceiver.Acommonqueuecanreceivemessagesfrommanyproducersormessages.Anyparticularmessagecanbereceivedandconsumedbyonlyonereceiver.Oncethattargetconsumerreadsamessageinthequeue,thatmessagedisappearsfromthatqueue.ThetypicalexampleofthissystemisanOrderProcessingSystem,whereeachorderwillbeprocessedbyoneOrderProcessor.
Publish-SubscribeMessagingSystem
Inapub-submessagingsystem,theapplicationspublishtheiroutputtoastandardmessagingqueue.Thetargetrecipientwillonlyneedtoknowwheretogetthemessage,wheneveritisreadytopickupthemessage.Applicationsthuscanignorethemechanicsofinteractionwithotherapplications,andsimplycareaboutthemessageitself.Thisisespeciallyvaluablewhentheremaybemanytargetrecipientsforamessage.Inapub-subsystem,messagesareenteredintothemessagingqueueasynchronouslyfromclientapplications.
Amessagequeuingsystemneedstobefastandsecuretoservemanyapplications,bothproducersandsubscribers.Messagesarealsoreplicatedacrossmultiplelocationsforreliabilityofdata.
TherearetwopopularDataingestingsystemsusedinBigData.Anoldersystem,calledFlume,iscloselytiedtotheHadoopdistributedfilesystem.ThenewandmorepopularsystemisageneralpurposesystemcalledApacheKafka.Inthischapterwewilldiscussthenewsystem,Kafka.
ApacheKafkaApacheKafkaisanopensourcepublish-and-subscribemessagebrokersystem.Kafkaaimstoprovideanintegratedhigh-throughput,low-latencymessagingplatformforhandlingreal-timedatafeeds.Intheabstract,itisasinglepointofcontactbetweenallproducersandconsumersofdata.AllproducersofdatasenddatatoKafka.AllconsumersofdatareaddatafromKafka.(Figure8.1)
Figure8‑1:Kafkacoreidea
Kafkaisadistributed,partitioned,scalable,replicatedmessagingsystem,withasimplebutuniquedesign.ItwasinitiallydevelopedbyLinkedInandwasopensourcedinearly2011.ApacheSoftwareFoundationisnowresponsibleforitsdevelopmentandimprovement.Kafkaisavaluableforanenterpriseslevelinfrastructurebecauseofitssimplicityandscalability.Kafkasystemiswritteninthehigh-levelScalaprogramminglanguage.
UseCasesFollowingaresomepopularusecasesofApacheKafka.
Messaging
KafkaisaverygoodalternativeforatraditionalmessagebrokerbecauseKafkamessagingsystemhasbetterthroughput,builtinpartitioning,replicationandbetterfaulttolerance.Kafkaisverygoodsolutionforalargescalemessageprocessingapplications.
WebsiteActivityTracking
WebsiteActivityTrackingwasoneofinitialusecasesforKafkaforLinkedIn.Users’onlineactivitytrackingpipelinewasrebuiltasasetofrealtimedatafeeds.Generalwebactivitytrackingincludesverylargevolumeofdata,andKafkaisverygoodathandlingthishugevolumeofdata.Useractivitytypessuchaspageview,searches,clicks,etccanbedesignatedascentraltopics,andtheactivitydatacanbepublishedtothosetopics.Thoseeventsareavailableforrealtimeorofflineprocessingandreporting.
StreamProcessing
PopularframeworkssuchasStormandSparkStreamingreaddatafromatopic,processit,andsendtootherusersandconsumerapplications.TheymayevenwriteitbacktoKafkatoanewtopic.Kafka’sstrongdurabilityisalsoveryusefulforstreamprocessing.
LogAggregation
ActivityLogaggregationtypicallygathersphysicallogfilesfromserversandputsthemallinacentralplaceforprocessing.Kafkacanabstractawaythedetailsofthefilesandprovideacleanerabstractionoflogdataasastreamofmessages.UseofKafkathenallowsforlower-latencyprocessingandeasiersupportformultipledatasourcesanddistributeddataconsumption.Unlikededicatedlog-centricsystems,Kafkaoffershigherperformanceandstrongerdurabilityguaranteesduetoreplication.
CommitLogKafkacanbeusedasexternalcommitlogforadistributeddatabasesystem.Thisauditlogcanhelptore-syncdatabetweenthefailednodestorestoretheirdata.ThelogcompactioninKafkahelpstoachievethisfeaturemoreefficiently.
KafkaArchitectureIntheabstract,Kafkabrokersdealwithproducersandconsumersofdata.Aproducerpushesdataintotheingestsystematitsownspeed,scaleandconvenience.Aconsumerpullsdataoutofthesystematitsownspeed,scaleandconvenience.Allthereceiveddataisorganizedbycategories,calledtopics.Incomingdataissortedandstoredintotopicservers.Theconsumersofdatacansubscribetooneormoretopics(Figure8.2).
Figure8‑2:KafkaEcosystem
Therearemorethanonebrokers(alsocalledservers,orpartitions)foreachtopic,forreliabilityofthemessagingsystem.Thustwoormorebrokerswillstoredataoneachtopic.Onlyonebrokercanbeleaderatanygiventime.Intheleadbrokerfails,thenasecondonecanautomaticallytakeoverandpreventthelossofaccesstodata.
Kafkaisdesignedfordistributedhighthroughputsystems.Incomparisontoothermessagingsystems,Kafkahasbetterthroughput,built-inpartitioning,replicationandinherentfault-tolerance,whichmakesitagoodfitforlarge-scalemessageprocessingapplications.Ithastheabilitytohandlealargenumberofdiverseconsumers.ItintegratesverywellwithApacheStorm,Sparkandotherreal-timestreamingdataapplications.Kafkaisveryfastandcanperform2millionwrites/sec.Italsoguaranteeszerodowntimeandzerodataloss.
TherearealotofcontributingorganizationshelpingtoimprovetheKafkaopen-sourcesystem.Ithasverywelldocumentedonlineresources.IthasbeenusedbymanybigorganizationssuchasLinkedIn,CiscoSystem,Spotify,Paypal,HubSpot,Shopify,Uberandmore.HubSpotusesKafkatodeliverrealtimenotificationofwhenarecipientopenstheiremail.PaypalusesKafkatoprocessmillionsofupdatesinaminute.
Producers
Aproducerisresponsibleforselectingthepartition,andthetopicforthemessagethatitwantstoconvey.Itcanuseround-robinalgorithmtobalancetheloadamongpartitions.Therecanbebothsynchronousandasynchronousproducersforproducingmessageandpublishingtothepartition.
Consumers
Aconsumerisresponsibleforreadingthedataaboutthetopicthatithassubscribed.Theconsumerisresponsibleforreadingthedatawithinareasonableperiodoftime,beforethe
queuesareemptiedforefficientmanagementofstorage.Differentconsumingapplicationscanreadthedataatdifferenttimes.Kafkahasstrongerorderingguaranteesthanatraditionalmessagingsystem.Aconsumerneedstoknowhowfarithasreadinthatqueue,soastoavoidduplicatesorlosesomedata.
Broker
AbrokerisaserverinaKafkacluster.Theclustermayhavemanysuchserversorbrokers.
Topic
Atopicisacategoryintowhichmessagesarepublished.Foreachtopicthereisaseparatepartitionlogforstorageofmessages.Eachpartitionhasanorderedsequenceofmessagesforthattopic.Eachmessageinthepartitionisassignedauniquesequentialnumber,alsocalledtheoffset.Thisoffsethelpstoidentifyeachmessagewithinthepartition.
Theconsumerreadsthedatasequentiallyaccordingtooffsetnumbers.Theconsumermaintainstheoffsettorememberhowfarithasread.Generally,theoffsetincreaseslinearlyasmessagesareconsumed.However,aconsumercanresetoffsettoaccessthedatagainandreprocessitasneeded.
TheKafkaclusterkeepsallthepublishedmessageswhetherornottheyhavebeenconsumedforaconfigurableperiodoftimeornot.Forexample,ifthelogretentionissettosevendays,thenforthesevendaysafterpublishing,themessageisavailableforconsumption.Aftersevendays,Kafkadiscardsthemessagestofreeupspace.
Kafka’sperformanceisnotaffectedbythesizeofdata.Eachpartitionmustfitontheserversthathostit,butatopicmayhavemultiplepartitions.ThisenablesKafkatomanageanarbitraryamountofdata.Also,itactsastheunitofparallelism.
SummaryofKeyAttributes1. Diskbased:Kafkaworksonaclusterofdisks.Itdoesnotkeepeverythingin
memory,andkeepswritingtothedisktomakethestoragepermanent.2. Faulttolerant:DatainKafkaisreplicatedacrossmultiplebrokers.Whenany
leaderbrokerfails,afollowerbrokertakesoverasleaderandeverything
continuestoworknormally.3. Scalable:Kafkacanscaleupeasilybyaddingmorepartitionsormorebrokers.
Morebrokershelptospreadtheloadandthisprovidesgreaterthroughput.4. Lowlatency:Kafkadoesverylittleprocessingonthedata.Thusithasverylow
latencyrateMessagesproducedbytheconsumerarepublishedandavailabletotheconsumerwithinafewmilliseconds.
5. FiniteRetention:Kafkabydefaultkeepsthemessageintheclusterforaweek.Afterthatthestorageisrefreshed.Thusthedataconsumershaveuptoaweektocatchupondata,incasetheyfallbehindforanyreason.
Distribution
TheKafkaclustermaintainsmultipleserversoverthedistributednetwork.Thepartitionsofthelogaremaintainedoverthisnetwork.Eachserverhandlesdataandrequestsforashareofthepartitions.Eachpartitionisreplicatedacrossaconfigurablenumberofserversforfaulttolerance.Butoneoftheserverforeachpartitionactsasthemainserveralsocalled“leader”whileitmayormaynothaveoneormoresecondaryserveralsoknownas“followers”.Theleaderserverisresponsibleforhandlingallthereadandwriteoperationforthepartitionwhilethefollowerssilentlyreplicatestheleader.Thefollowerserverbecomesveryhelpfulwhentheleaderserverfails.Thefollowerserverautomaticallybecomestheleaderandthenhandlesthefailure.Oneservercanbealeaderforsomeofthepartitionsonit,whileitmaybefollowerforotherpartitions.Thusoneservercanactasbothleaderandfollower.Thishelpstobalancetheworkloadontheserverswithinthecluster.
Guarantees
Messagessentalwaysmaintaintheordertheyweresent.Forexample,ifamessageM1andM2weresentbythesameproducerandM1wassentfirstthenthemessageM1willhaveloweroffsetthanmessageM2.Therefore,M1willalwaysappearbeforetheM2fortheconsumer.
EachtopichasareplicationfactorNandthesystemcantolerateuptoN-1serverfailureswithoutlosinganymessagescommittedtothelog.
ClientLibraries
Kafkasupportsfollowingclientlibraries:
1. Python:PurepythonimplementationwithfullprotocolsupportandConsumerProducerarealsoincluded.
2. C:HighperformanceClibrarywithfullprotocolsupport.3. C++,Ruby,Javascriptandmore.
ApacheZooKeeperKafkaisbuiltontopofZooKeeper.ApacheZookeeperisadistributedconfigurationandsynchronizationserviceinHadoopclusters.HereitservesasthecoordinationinterfacebetweentheKafkabrokersandconsumers.TheKafkaserversstoresbasicmetadatainZookeeperandsharesinformationabouttopics,brokers,andconsumeroffsets(queuereaders)andsoon.
SinceZookeeperdoesitownlayersofreplication,thefailureofaKafkabrokerdoesnotaffectthestateoftheKafkacluster.EvenifZookeeperfails,Kafkawillrestorethestate,oncetheZookeeperrestarts.ThisgiveszerodowntimeforKafka.Zookeeperalsomanagesthealternativeleaderbrokerselection,incaseofaKafkaleaderfailure.KafkaProducerexampleinJava
//Configure
Propertiesconfig=newProperties();
config.setProperty(ProducerConfig.BOOTSTRAP_SERVER_CONFIG,“localhost:8082”);
KafkaProducerproducer=newKafkaProducer(config);
ProducerRecordrecord=newProducerRecord(“topic”,“key”.getBytes(),”value”.getBytes());
Future<RecordMetaData>response=producer.send(record);
ConclusionBigdataisingestedusingadedicatedsystem.Theseoftentaketheformofmessagingsystems.Publish-and-subscribesystemsareefficientwaysofdeliveringdatafrommanysourcestomanytargets,inareliable,secureandefficientway.Kafkaisanopen-source,reliable,secure,andscalabledatapublish-subscribemessagingsystem.Itdealswithproducersaswellasconsumersofdata.Messagesarepublishedtoasetofcentraltopics.Eachconsumercansubscribetoanynumberoftopics.Kafkausesaleader-followersystemofmanagingreplicatedpartitionsforthesamesetofdata,toensurefullreliabilityandzerodowntime.
ReviewQuestionsQ1:Whatisadataingestsystem?Whyisitanimportanttopic?
Q2:Whatarethetwowaysofdeliveringdatafrommanysourcestomanytargets?
Q3:WhatisKafka?Whatareitsadvantages?Describe3usecasesofKafka.
Q4:Whatisatopic?Howdoesithelpwithdataingestmanagement?
References1.http://kafka.apache.org/documentation.html#introduction
Chapter9–CloudComputingPrimerCloudcomputingisacost-effectiveandflexiblemodeofdeliveringITinfrastructureasaservicetoclients,overinternet,onameteredbasis.ThecloudcomputingmodeloffersclientsenormousflexibilitytouseasmuchITcapacity–compute,storage,network–asneededwithouthavingtoinvestinadedicatedITcapacityonone’sown.TheITusagecanbescaledupordowninminutes.ThecomplexITinfrastructuremanagementskillsareallownedbythecloudcomputingprovider,andproblemscanberesolvedmuchfaster.TheclientcansimplyaccessasmoothlyrunningITinfrastructureoverafastinternetconnection.ITcapacityinthecloudcanbepurchasedasacustompackagedependinguponone’sneedsintermsofaverageandpeakITrequirements.Thecomputingcloudistheultimatecosmiccomputeralignedwithalllawsofnature.
IntroductionManagingverylargeandfastdatastreamsisahugechallenge.Itrequiresmakingcriticaldecisionsaboutitsstorage,structure,andaccess.Thisdatawouldbestoredinlargeclustersofhundredsorthousandsofinexpensivecomputers.Suchclustersareoftencalledserverfarms.Thelocationandsizeofsuchclustersimpactscosts.Theserverfarmsmaybelocatedintheirowndatacenters,ortheymayberentedfromspecializedthird-partyorganizationscalledcloudcomputingserviceproviders.
CloudcomputingprovidestheITleadershipacost-effectiveandpredictablesolutionforreliablymeetingtheirlargedatamanagementneeds.Therearemanyvendorsofferingthisservice.Priceskeepdroppingregularly,becauseITcomponentskeepgettingcheaper,thereisgrowingvolumeofbusiness,andthereiseffectivecompetition.Withcloudcomputing,theITexpensebecomesanoperatingexpenseratherthanacapitalexpense.ThecostsofITbecomesalignedwithrevenuestreamsandmakescashflowmanagementeasier.
Oneofthemainreasonsforenterprisesmovingtocloudcomputingistoexperimentwithnewandriskyprojects.Thisflexiblemodelmakesitmucheasiertolaunchnewproductsandservices,withoutbeingexposedtotheriskofaheavylossinITinfrastructure.Forexample,anewHollywoodmovie’ssitewillhavemillionsofvisitorstoitswebsiteforamonthbeforeandforamonthafterthemovie’sreleasedate.Afterthatthevisitstothewebsitewilldropdramatically.Thewebsiteownerwouldbenefitenormouslyfromusingacloudcomputingmodelwheretheypayforthepeakwebusagecapacityforthosefewmonths,andmuchlessastheusagedropsdown.Moreimportantly,theflexibilityensuresthattheirwebsitewillnotcrashjustincasethemoviebecomesasuper-hitandattractsunusuallylargenumberofvisitorstothewebsite.
CloudComputingCharacteristicsHerearethemajorcharacteristicsofacloudcomputingmodel.
1. FlexibleCapacity:Thecapacitycanscaleuprapidly.Onecanexpandandreduceresourcesaccordingtoone’sspecificservicerequirements,asandwhenneeded.Thecloudinternallydoesregularworkloadbalancingamongtheneedsofmillionsofclients,andthishelpsbringdowncostsforeveryone.
2. Attractivepaymentmodel:Cloudcomputingworksonapay-per-usemodel.i.e.onepaysonlyforwhatoneuses,andforhowlongoneusesit.ITcostsbecomeanexpenseratherthanacapitalexpensefortheclient.Theresourcepricesmaybenegotiatedatlong-termcontractrates,andcanalsobepurchasedatspotmarketrates.
3. ResiliencyandSecurity:Thefailureofanyindividualserverandstorageresourcesdoesnotimpacttheuser.TheServersandstorageforallclientsareisolatedtomaximizesecurityofdata.
In-housestorageMostorganizationshavedatacentersforrunningtheirregularIToperations.Anorganizationmaydecidetoexpanditsowndatacentertostorelargestreamsofdata.Theorganizationcanensurecompletesecurityandprivacyofitsdataifitkeepsallthedatain-house.However,thecostsandcomplexityofmanagingthisdataareincreasing,anditisnotcost-effectiveforeveryorganizationtomanagehugedatacenters.Hiringandretainingscarceadvancedskillstomanagesuchdatacenterswouldalsobeachallenge.
CloudstorageItisnowbecomingatrendfororganizationstochoosetostoretheirdatainmassivedatacentersownedbyotherspecializedcompanies.Theirdataandprocessingcapacityresidesinsomesortofahugecloudoutthere,whichisaccessiblefromanywhereanytimethroughasimpleinternetconnection.
CompanieslikeAmazon,Google,Microsoft,Apple,andIBMareamongthemajorprovidersofcloudstorageandcomputingservicesaroundtheworld.Theyownandoperatedatacenterswithmillionsofcomputersinthem.
Figure0‑1:Acloudcomputingdatacenter
Commercially,cloudserviceprovidersareabletoconsolidatetherequirementsofthousandsormillionsofcustomers,andsupplyflexibleamountsofdatastorageandcomputingfacilityavailabletoclientsonaper-usagebasis.Thispaymodelissimilartohowelectricutilitycompanieschargeconsumersfortheirusageofelectricityinhomesandoffices.Cloudcomputingoffersmuchlowercostsperuse,justlikeusingtheelectricutilitycostsmuchlessthanowningandoperatingone’sownelectricitygenerators.
Amajordisadvantageofcloudstorageisthatthedataisstoredawayfromone’sphysicalcontrol.Thussecurityofpreciousdataislefttothehandsofthecloudcomputingprovider.Whilethesecurityprotocolsarerapidlyimproving,however,therearenofailsafemethodsforsecuringdatainthecloud.Thereisalsoariskofbeinglockedintooneprovider’sinfrastructure.Thecost-benefittradeoffshavedefinitelytiltedtowardsusingcloudcomputingproviders.Atsomefuturepointintime,thecloudservicesprovidersmightbeheavilyregulatedliketheelectricutilities.
CloudComputing:EvolutionofVirtualizedArchitectureCloudcomputingisessentiallyacommercialmodelforvirtualizedserverinfrastructure.IBMbegantooffertime-sharingservicesonitsmainframecomputersbeginninginthe1960s.Nowthatsametechnologyhasbeenofferedonnetworksofsmallmachinesthroughthevirtualizationprocess.
Virtualizationassumesthatlogicalmachinescanbedifferentiatedfromphysicalmachines.AphysicalservercouldrunmultipleVirtualMachines(VMs);andonevirtualmachinemayspanmultiplephysicalservers.Thevirtualizationsoftwareiscalledahypervisor.ItabstractsallmachinesintoVirtualMachines,usingeasyGUIinterface.Avirtualizationsoftwarecantypicallyrunonaheterogeneousphysicalinfrastructure,andconvertallITcapacityintoasingleunifiedcapacity.Thiscapacitycanthenbeprovisionedinslicesandpackages.Theuserapplicationsarenotawarethattheyarerunninginavirtualizedenvironment;sotheyrunasifrunningonadedicatedmachine.Theapplicationscanalsorunontopoftheirownnativeoperatingsystems.
CloudServiceModelsTherearetwomajordimensionstoconceptualizetheCloudcomputingmodels:thescopeofservicesreceived;andthecontroloverandcostofthoseservices.
1. Therangeofcloudcomputingservicesfromacloudcomputingprovider,fallinthreebroadbuckets:
1. Infrastructureasaservice:Thisisthelowestlevelofservices,andincludedonlyrawcapacityofcompute,storage,andnetworking.Thepriceforthisservicesisthelowest.
2. Platformasaservice:ThisincludesIaaS,alongwithothertechnologiesandservices.ThesearestillverygeneraltoolssuchopensourceHadooporSparkorCassandraimplementation,alongwithcertainmonitoringtools.Thecostsarealittlehigherbecauseoftheadditionalmanagementandmonitoringservicesprovidedbytheprovider.
3. Softwareasaservice:Thisincludesthecomputingplatformaswellasbusinessapplicationsthatgetworkdone.Forexample,salesforce.comwasoneofthefirstCRMapplicationsoldonlyonaSaaSmodel.Googlesellsanemailservicetoorganizationsonaper-user-per-monthbasis.Thisisalsothemostexpensivetypeofcloudservice.
2. Theotherwaythecloudservicesdifferisintermsoftheownershipandcontrol.1. Publiccloud:Thiswillbealargesharedinfrastructuremadeavailableto
oneandall,inalow-costandmulti-tenancymodel.Theclientcanaccessitusinganydevice.Thedownsideisthatthedataalsoresidesonthecloud,andthuscouldbevulnerabletotheftorhacking.Thecoststo
clientarelow,andvariabledependinguponuse.2. Privatecloud:Thisisacloudversionofanin-houseITinfrastructure.
Theorganizationwillhaveexclusivecontrolovertheentireinfrastructure.Thecostswouldbefixedandhigher.
3. Hybridcloud:Thisisamixofflexibilityofcapacity,andmuchcontroloversomekeyaspectsofit.Onecouldretaincompletecontrolovercriticalapplications,whileusingsharedinfrastructurefornon-criticalapplications.
Alllevelsofinfrastructureandpaymodelsareuseful,astheyserverdifferentlevelsofneedsforclientorganizations.However,mostofthegrowthincloudcomputingishappeningbecauseoftheattractivenessofthelowcostofthepubliccloudmodel.
CloudComputingMythsThereareacoupleofmisconceptionsaboutthecostsandbenefitsofcloudcomputing.
1. Myth:PublicCloudcomputingwouldsatisfyalltherequirement:scalability,flexibility,payperuse,resilience,multitenancy,andsecurity.Dependinguponthetypeofserviceselected(SaaS,IaaS,orPaaS),theservicecansatisfyspecificsubsetsoftheserequirements.
2. Myth:CloudcomputingwouldbeusefulonlyifyouareoutsourcingyourITfunctionstoanexternalserviceprovider.OnecoulduseaprivatecloudcomputingmodelforasectionofITapplicationstoofferon-demand,scalable,andpay-per-usedeploymentswithinyourenterprise’sowndatacenter.
CloudComputing:GettingStartedHerebelowisaframeworkforcloudadoption.Learnmoreaboutthecontextforgettingbenefitsfromcloudcomputing.Selecttherightmodelandlevelofcloudcapacity.Setuptheapplicationsandamonitoringsystemforthoseapplicationandthetotalcloudfootprint.Chooseaserviceprovider,sayAmazonWebServices,theleadingproviderofcloudcomputing.UseAppendixAtoinstallHadooponAWSEC2publiccloud
infrastructure.
ConclusionCloudcomputingisabusinessmodeltoprovideshared,flexible,cost-effectiveITinfrastructuretogetstartedquicklyonbuildinganapplication.ForBigDataapplications,itcanbeevenmoreattractivetotestthesystemusingrentedfacilities,beforemakingthedeterminationofinvestingindedicatedITinfrastructure.
ReviewQuestionsQ1:DescribeCloudComputingmodel.
Q2:Whataretheadvantagesofcloudcomputingoverin-housecomputing
Q3:DescribethetechnicalarchitectureforCloudcomputing.
Q4:Nameafewmajorprovidersofcloudcomputingservices.
Section3
ThissectioncoverstheotherrelevantconceptsandtutorialsforeffectivelymanagingandutilizingBigData.
Chapter10willbringallthetoolstogetherinacasestudyofdevelopingwebloganalyzer,asanexampleofausefulBigDataapplication.
Chapter11willcovertheoverallviewofDataMiningtoolsandtechniquestoextractbenefitfromBigData.
Appendix1showsstepbystep,thewaytoinstallaHadoopclusteronacloudcomputingplatform.
Appendix2isatutorialoninstallingandrunningSpark.
Chapter10–WebLogAnalyzerapplicationcasestudy
IntroductionAwebloganalyzerisanautomatedsoftwaretoolthathelpstoanalyzeandmakedecisionsonanumberofissuesregardingwebapplicationserverlogs.Anidealwebloganalyzerwouldanalyzeunlimitedstreamsofdataandhelpkeeptheentireuniverserunningsmoothlyandwithoutfault.Thiswouldbedonebyeliminatingtheneedformanuallyaccessingthelogs,automatingtheflowofinformation,andalertingthesystemadministratorasneeded.
Client-ServerArchitectureEveryweb-basedapplicationrunsonaclient-serverarchitecture.Clientsareentitiesthataccessservers,andserversareentitiesthatrespondtotheclientwithasolution.Alotofclientssimultaneouslytrytoaccessservers.Theserversmaybedatabaseserver,networkserver,theapplicationserver,oranyserverinthen-tierarchitecture.Foreachrequest,alogentryisgenerated.Thespeedofaccessrequestsdeterminedthestreamoflogentries.Thisleadstoapotentiallyhugelogovertime.Thelogcanbeprocessedasstreamofdata.Thislogcanalsobestoredontheserversforlateranalysis.
Logscanbeusedformonitoring,auditandanalysispurposes.Itcanhelpwitherrordiagnosticsincaseawebsitebecomessloworitgoesdown.Logscanbeanalyzedtodetecthackingactivity.Theycanalsobeanalyzedtosummarizethepopularityofwebpages,andthedistributionofthepagerequesters.Itcanhelpwithaccessvolumes,andforscalingupordowntheinfrastructure.
WebLoganalyzerTheloganalyzerreceivedstreaminglogsfromaserverlocation,andanalyzesmultiplethingsusingmanyalgorithmstogeneratethedesiredresults.Thesystemiscompletelyautomated.Thelogisproduced,anditisconsumedittomakereal-timereports.Itiseasytoimaginethemassivedataflowproducedbythelogintheserverenvironmentwhileitisalsobeinganalyzedsimultaneouslyontheadministratorside.`
Requirements
Thisisaloganalyzertoanalyzeawebapplicationhostedonaserver.Itisabusyapplicationownedbyabigcompany.Itreceivesmorethan15000webaccessrequestsperhour.Alltheaccessrequestsneedtobelogged,anddumpedtoHadoopFilesystemperiodically.Theanalyzerisrequiredtoingestreal-timelogdata,andfilteroutapartofdataforanalyzinganddumpingtoHDFS.Ithastodostreamingdataflowmanagementaswellasbatchprocessing.TheanalyzerneedstoprocessthedatabeforeitisdumpedintoHDFS,andalsoafteritisputintoHDFS.Thesystemadministratorsshouldbealertedinrealtimeaboutpossiblethreats,overloads,delays,potentialserrors,andanyotherdamages.Theresultsofalltheanalysesneedtobestoredinadatabaseforlaterpresentationinagraphicalformat.Theresultshavetobemadeavailableforanyperiodoftime,withoutanymissingtimevalues.Thelogdatahastobepreservedforfuturewithoutlosinganylogdata.
SolutionArchitecture
GetstreamingdatausingApacheFlume,andsendittoHDFS.UseApacheSparkfordataflowmanagementplatformandprocessingengine.StoretheresultsofanalysisinMongoDB.Thisisasafesolution,becausethedatagetsstoredintoHadoopclusterandisavailableforfuturerequirements,evenwhileitisbeinganalyzedinrealtime.Theresultsofreal-timeprocessingalsogointoMongoDB.
Fig10.1:WebLogAnalyzerArchitecture
BenefitsofthissolutionTheadvantagesofthissolutionare:
1. RealtimeloggingandanalysisdatageneratedonserverisstreameddirectlytoHDFSbyFlumeagentwithoutdelay.Everylogentrygeneratedovereverysinglepointoftimeisanalyzedandusedformonitoringanddecisionmaking,
2. Automaticloghandlingandstorage.LoadingdataintoHDFSnormallyrequiresmanuallyrunningcertainHadoopcommands.ThisloganalyzerusesaFlumeagentorsparkstreamingtohandlealldataonitsown,withoutanyexternallymanagedefforts.
3. Easyandconvenientimplementusingbuilt-inandeasy-to-customizemachinelearningalgorithmsinSpark.
4. Easyerrorhandling,serverrequesthandling,andoverallserverperformanceoptimization.Itmakesserversmarterbykeepingtrackofalmosteveryaspectsofserver.
TechnologystackThetechnologystackusedforthisapplicationisshownbelow.Abriefofeachcomponentfollows.
1. ApacheSparkv22. Hadoop2.6.0cdh53. ApacheFlume4. Scala,Java5. MongoDB6. RestFulWebservices7. FrontUItools8. LinuxShellScripts
ApacheSparkSparkisfastin-memory-basedclustercomputingtechnology,designedforfastandstreamingcomputation.ItisbuiltontopofHadoopandMapReducesystem,anditextendsMapReducemodeltousemoretypesofcomputation,whichincludesinteractivequeriesandstreamprocessing.Ithaslotoflibrariesandpackageslikemachinelearning(MLLib),graphcomputation(GraphX)etc.Itclaimstoexecute10to100timesfasterthanHadoopbecauseofitsin-memorycomputationmodel.ItalsosupportsmultiplelanguagessuchasScala,Python,Java,andR.
SparkDeployment
1. Standalone2. HadoopYARN3. SIMR:SparkinmapReduce//Mesos
ComponentsofSpark
SparkSql:DataabstractioncalledschemaRDD,whichprovidessupportforstructuredandsemi-structureddata.
SparkStreaming:IngestsdatainminibatchandperformRDDtransformationonthosemini-batches.StreamingdataanalyticsusingRDD
MLib(machinelearning):Itisadistributedmachinelearningframework,whichoperatesin-memoryathighspeed,andoffersmanyMLalgorithms.
GraphX:ThisdistributedgraphprocessingframeworkprovidesAPIformanygraphcomputationalgorithms.
SparkCore:Thisisageneralexecutionengineforsparkplatformuponwhichallotherfunctionalityisbuilt.Ittakescareoftaskdispatchingandscheduling,andbasicI/Ofunctionalities.
Spark-shell:Itisapowerfultooltoanalyzedatainteractively.Itisavailableonscalaandpython.Spark’sprimarydataabstractionisanin-memorycollectionsofitemscalledRDD.ItcanbecreatedfromHadoopinputformatslikeHDFS,andbytransformingexistingRDDsusingfiltersandmapsintonewRDDs.
ScriptingandProgrammingmodelusingSparkContext:OnecanuseanIDEtodevelopandtesttheanalyticscode.OnecanthencreateajartoruntheanalyticsusingHadooparchitecture.Thejarcanalsobesubmittedusingspark-submitutilitytotheSparkengine.Forexample:
spark-submit—classapache.accesslogs.ServerLogAnalyzer—master
*localScalaSpark/Scala1/target/scala-2.10/Scala1-assembly-1.0.jar>output.txt
HDFSHDFSisadistributedfilesystem,thatisatthecoreofHadoopsystem.
-Deployedonlowcostcommodityhardware
-Faulttolerant
-SupportsBatchProcessing
-Designedforlargedatasetorlargefiles
-Maintainscoherencethroughwriteoncereadmanytimes
-Movingcomputationtothelocationofthedata.
MongoDBItisdocument-orienteddatabase.ItcameintoexistenceasaNoSQLdatabase.
ApacheFlumeFlumeisanopensourcetoolforhandlingstreaminglogsordata.Itisadistributedandreliablesystemforefficientlycollecting,aggregatingandmovinglargeamountofdatafrommanydifferentsourcestoacentralizeddatastore.ItisapopulartooltoassistwithdataflowandstoragetoHDFS.Flumeisnotrestrictedtologdata.Thedatasourcesarecustomizablesoitmightbeanysourcelikeeventdata,trafficdata,socialmediadata,oranyotherdatasource.ThemajorComponentsofFlumeare:
-Event
-Agent
-DataGenerators
-CentralizedStores
OverallApplicationlogicThesystemreadsaccesslogsandpresentstheresultsintabularandgraphicalformtoendusers.Thissystemprovidesthefollowingmajorfunctions:
1. Calculatecontentsize2. CountResponsecode3. AnalyzerequestingIP-address4. ManageEndpoints
TechnicalPlanfortheApplicationTechnically,theprojectfollowsthefollowingstructure:
1. FlumetakesstreaminglogfromrunningapplicationserverandstoresinHDFS.Flumeusescompressiontostorehugelogfilestospeedupthedatatransferandforstorageefficiency.
2. ApacheSparkusesHDFSasinputsourceandanalyzesdatausingMLLib.ApacheSparkstoresanalyzeddatainMongoDB
3. RESTfuljavaservicepresentsJSONobjectsfetchingfromMongoDBandsendingtoFrontend.Graphicaltoolsareusedtopresentdata.
ScalaSparkcodeforloganalysisNote:ThisapplicationiswritteninScalalanguage.Belowistheoperativepartofthecode.VisitgithublinkbelowforthecompleteScalacodeforthisapplication.
//calculatessizeoflog,andprovidesmin,maxandaveragesize
//cachingisdoneforrepeatedlyusedfactors
defcalcContentSize(log:RDD[AccessLogs])={
valsize=log.map(log=>log.contentSize).cache()
valaverage=size.reduce(_+_)/size.count()
println(“ContentSize::Average::”+average+””+
”||Maximum::”+size.max()+“||Minimum::”+size.min())
}
//SendalltheresponsecodewithitsfrequencyofoccurrenceasOutput
defresponseCodeCount(log:RDD[AccessLogs])={
valresponseCount=log.map(log=>(log.responseCode,1))
.reduceByKey(_+_)
.take(1000)
println(s”““ResponseCodesCount:${responseCount.mkString(“[“,“,”,“]”)}”””)
}
//filtersipaddressesthathavemorethen10requestsinserverlog
defipAddressFilter(log:RDD[AccessLogs])={
valresult=log.map(log=>(log.ipAddr,1))
.reduceByKey(_+_)
.filter(count=>count._2>1)
//.map(_._1).take(10)
.collect()
println(“IPAddressesCount::${result.mkString(“[“,“,”,“]”)}”)
}}
SampleLogdataSampleInputData:
InputFields(selectedfields):
Certainfieldshavebeenomittedtomakethecodeclear.Theresponsecodehasbeencoloredinredasitisthebasisofthemajorreports.
1. ipAddress:String,2. dateTime:String,3. method:String,4. endPoint:String,5. protocol:String,6. responseCode:Long,7. contentSize:Long
SampleInputRowsofData:
64.242.88.10[07/Mar/2014:16:05:49-0800]“GET/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariablesHTTP/1.1”40112846
64.242.88.10[07/Mar/2014:16:06:51-0800]“GET/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2HTTP/1.1”2004523
64.242.88.10[07/Mar/2014:16:10:02-0800]“GET/mailman/listinfo/hsdivisionHTTP/1.1”2006291
64.242.88.10[07/Mar/2014:16:11:58-0800]“GET/twiki/bin/view/TWiki/WikiSyntaxHTTP/1.1”2007352
64.242.88.10[07/Mar/2014:16:20:55-0800]“GET/twiki/bin/view/Main/DCCAndPostFixHTTP/1.1”2005253
64.242.88.10[07/Mar/2014:16:23:12-0800]“GET/twiki/bin/oops/TWiki/AppendixFileSystem?template=oopsmore¶m1=1.12¶m2=1.12HTTP/1.1”20011382
SampleOutputofWebLogAnalysisContentSize::Average::10101||Maximum::138789||Minimum::0
ResponseCodesCount:[(401,113),(200,591),(302,1)]
IPAddressesCount::[(127.0.0.1,31),(207.195.59.160,15),(67.131.107.5,3),(203.147.138.233,13),(64.242.88.10,452),(10.0.0.153,188)]
EndPoints::[(/wap/Project/login.php,15),(/cgi-bin/mailgraph.cgi/mailgraph_2.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_0.png,12),(/wap/Project/loginsubmit.php,12),(/cgi-bin/mailgraph.cgi/mailgraph_2_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_1.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_0_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_1_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_3_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_3.png,12)]
IntermediatedataisstoredinHadoopFileSysteminCSVformat
Toseedetailedcode,visit:https://github.com/databricks/reference-apps/blob/master/logs_analyzer/chapter1/scala/src/main/scala/com/databricks/apps/logs/chapter1/LogAnalyzer.scala
Thiswebloganalyzercanbeenhancedinmanyways.Forexample,itcananalyzehistoryoflogsfrompreviousyearsanddiscoverwebaccesstrends.Thisapplicationcanalsobemadetodiscarddataolderthan5yearsintopermanentandbackupstorage.
ConclusionandFindingsTherearemorethan100technologiesaroundApacheecosystem.MostbasicistheMapReducetechniqueusedbyHadoopengine.ManystacksareavailableontopofMapReduce.Itisimportanttoincorporatetherightsetsofelementstodeveloptherightstackfortheparticularlargescaledataanalytics.AfewawesometechnologieslikeHDFS,Spark,Hive,MongoDB,andFlume/Kafkaislikelytomakethebigdataapplicationpowerfulandworthy.
Itisalsousefultoexperimentwithmanyothertechnologiesduringthedevelopmentofthisloganalyzer.FlumeandKafkaaremostpowerfultoolstohandlestreamingdata.SparkhasitsownstreamingAPI,butit’snoteasytoincorporatewithHDFSstorage.DevelopingthisapplicationalsohelpstolearnLinuxbasedtasksandshellscriptsalongwithsomedatahandlingtoolslikeAWKandStreamEditor.
Thisapplicationreducesburdenofmanualhandlingoflogsondatabase,applicationorhistoryservers.Moreover,ithelpstopresentanalyzeddatainanimpressivewaythatleadstoeasydecisionmaking.ThisapplicationcameintodevelopmentafterdoingmuchresearchonbigdatatoolssuchasApacheSpark.Thatsavedalottimeandcostlater.Itwasdevelopedusingagiledevelopmentpractices.
ReviewQuestionsQ1.Describetheadvantagesofawebloganalyzer.
Q2.Describethemajorchallengesindevelopingthisapplication.
Q3:Checkoutthereferencesbelow.Identify3-4majorlessonslearnedfromthecodeandvideo.
Chapter10:DataMiningPrimer
Dataminingistheartandscienceofdiscoveringknowledge,insights,andpatternsindata.Itistheactofextractingusefulpatternsfromanorganizedcollectionofdata.Patternsmustbevalid,novel,potentiallyuseful,andunderstandable.Theimplicitassumptionisthatdataaboutthepastcanrevealpatternsofactivitythatcanbeprojectedintothefuture.
Dataminingisamultidisciplinaryfieldthatborrowstechniquesfromavarietyoffields.Itutilizestheknowledgeofdataqualityanddataorganizingfromthedatabasesarea.Itdrawsmodelingandanalyticaltechniquesfromstatisticsandcomputerscience(artificialintelligence)areas.Italsodrawstheknowledgeofdecision-makingfromthefieldofbusinessmanagement.
Thefieldofdataminingemergedinthecontextofpatternrecognitionindefense,suchasidentifyingafriend-or-foeonabattlefield.Likemanyotherdefense-inspiredtechnologies,ithasevolvedtohelpgainacompetitiveadvantageinbusiness.
Forexample,“customerswhobuycheeseandmilkalsobuybread90percentofthetime”wouldbeausefulpatternforagrocerystore,whichcanthenstocktheproductsappropriately.Similarly,“peoplewithbloodpressuregreaterthan160andanagegreaterthan65wereatahighriskofdyingfromaheartstroke”isofgreatdiagnosticvaluefordoctors,whocanthenfocusontreatingsuchpatientswithurgentcareandgreatsensitivity.
Pastdatacanbeofpredictivevalueinmanycomplexsituations,especiallywherethepatternmaynotbesoeasilyvisiblewithoutthemodelingtechnique.Hereisadramaticcaseofadata-drivendecision-makingsystemthatbeatsthebestofhumanexperts.Usingpastdata,adecisiontreemodelwasdevelopedtopredictvotesforJusticeSandraDayO’Connor,whohadaswingvoteina5–4dividedUSSupremeCourt.Allherpreviousdecisionswerecodedonafewvariables.Whatemergedfromdataminingwasasimplefour-stepdecisiontreethatwasabletoaccuratelypredicthervotes71percentofthetime.Incontrast,thelegalanalystscouldatbestpredictcorrectly59percentofthetime.(Source:Martinetal.2004)
GatheringandselectingdataTolearnfromdata,qualitydataneedstobeeffectivelygathered,cleanedandorganized,andthenefficientlymined.Onerequirestheskillsandtechnologiesforconsolidationandintegrationofdataelementsfrommanysources.
Gatheringandcuratingdatatakestimeandeffort,particularlywhenitisunstructuredorsemistructured.Unstructureddatacancomeinmanyformslikedatabases,blogs,images,videos,audio,andchats.Therearestreamsofunstructuredsocialmediadatafromblogs,chats,andtweets.Therearestreamsofmachine-generateddatafromconnectedmachines,RFIDtags,theinternetofthings,andsoon.Eventuallythedatashouldberectangularized,thatis,putinrectangulardatashapeswithclearcolumnsandrows,beforesubmittingittodatamining.
Knowledgeofthebusinessdomainhelpsselecttherightstreamsofdataforpursuingnewinsights.Onlythedatathatsuitsthenatureoftheproblembeingsolvedshouldbegathered.Thedataelementsshouldberelevant,andsuitablyaddresstheproblembeingsolved.Theycoulddirectlyimpacttheproblem,ortheycouldbeasuitableproxyfortheeffectbeingmeasured.Selectdatacouldalsobegatheredfromthedatawarehouse.Everyindustryandfunctionwillhaveitsownrequirementsandconstraints.Thehealthcareindustrywillprovideadifferenttypeofdatawithdifferentdatanames.TheHRfunctionwouldprovidedifferentkindsofdata.Therewouldbedifferentissuesofqualityandprivacyforthesedata.
DatacleansingandpreparationThequalityofdataiscriticaltothesuccessandvalueofthedataminingproject.Otherwise,thesituationwillbeofthekindofgarbageinandgarbageout(GIGO).Thequalityofincomingdatavariesbythesourceandnatureofdata.Datafrominternaloperationsislikelytobeofhigherquality,asitwillbeaccurateandconsistent.Datafromsocialmediaandotherpublicsourcesislessunderthecontrolofbusiness,andislesslikelytobereliable.
Dataalmostcertainlyneedstobecleansedandtransformedbeforeitcanbeusedfordatamining.Therearemanywaysinwhatdatamayneedtobecleansed–fillingmissingvalues,reigningintheeffectsofoutliers,transformingfields,binningcontinuousvariables,andmuchmore–beforeitcanbereadyforanalysis.Datacleansingandpreparationisalabor-intensiveorsemi-automatedactivitythatcantakeupto60-80%ofthetimeneededforadataminingproject.
OutputsofDataMiningDataminingtechniquescanservedifferenttypesofobjectives.Theoutputsofdataminingwillreflecttheobjectivebeingserved.Therearemanywaysofrepresentingtheoutputsofdatamining.
Onepopularformofdataminingoutputisadecisiontree.Itisahierarchicallybranchedstructurethathelpsvisuallyfollowthestepstomakeamodel-baseddecision.Thetreemayhavecertainattributes,suchasprobabilitiesassignedtoeachbranch.Arelatedformatisasetofbusinessrules,whichareif-thenstatementsthatshowcausality.Adecisiontreecanbemappedtobusinessrules.Iftheobjectivefunctionisprediction,thenadecisiontreeorbusinessrulesarethemostappropriatemodeofrepresentingtheoutput.
Theoutputcanbeintheformofaregressionequationormathematicalfunctionthatrepresentsthebestfittingcurvetorepresentthedata.Thisequationmayincludelinearandnonlinearterms.Regressionequationsareagoodwayofrepresentingtheoutputofclassificationexercises.Thesearealsoagoodrepresentationofforecastingformulae.
Population“centroid”isastatisticalmeasurefordescribingcentraltendenciesofacollectionofdatapoints.Thesemightbedefinedinamultidimensionalspace.Forexample,acentroidcouldbe“middle-aged,highlyeducated,high-networthprofessionals,marriedwithtwochildren,livinginthecoastalareas”.Orapopulationof“20-something,ivy-league-educated,techentrepreneursbasedinSiliconValley”.Oritcouldbeacollectionof“vehiclesmorethan20yearsold,givinglowmileagepergallon,whichfailedenvironmentalinspection”.Thesearetypicalrepresentationsoftheoutputofaclusteranalysisexercise.
Businessrulesareanappropriaterepresentationoftheoutputofamarketbasketanalysisexercise.Theserulesareif-thenstatementswithsomeprobabilityparametersassociatedwitheachrule.Forexample,thosethatbuymilkandbreadwillalsobuybutter(with80percentprobability).
EvaluatingDataMiningResultsTherearetwoprimarykindsofdataminingprocesses:supervisedlearningandunsupervisedlearning.Insupervisedlearning,adecisionmodelcanbecreatedusingpastdata,andthemodelcanthenbeusedtopredictthecorrectanswerforfuturedatainstances.Classificationisthemaincategoryofsupervisedlearningactivity.Therearemanytechniquesforclassification,decisiontreesbeingthemostpopularone.Eachofthesetechniquescanbeimplementedwithmanyalgorithms.Acommonmetricforallofclassificationtechniquesispredictiveaccuracy.
PredictiveAccuracy=(CorrectPredictions)/TotalPredictionsSupposeadataminingprojecthasbeeninitiatedtodevelopapredictivemodelforcancerpatientsusingadecisiontree.Usingarelevantsetofvariablesanddatainstances,adecisiontreemodelhasbeencreated.Themodelisthenusedtopredictotherdatainstances.Whenatruepositivedatapointispositive,thatisacorrectprediction,calledatruepositive(TP).Similarly,whenatruenegativedatapointisclassifiedasnegative,thatisatruenegative(TN).Ontheotherhand,whenatrue-positivedatapointisclassifiedbythemodelasnegative,thatisanincorrectprediction,calledafalsenegative(FN).Similarly,whenatrue-negativedatapointisclassifiedaspositive,thatisclassifiedasafalsepositive(FP).Thisisrepresentedusingtheconfusionmatrix(Figure4.1).
ConfusionMatrix TrueClass
Positive Negative
PredictedClass
Predictedclass
Positive
TruePositive(TP)
FalsePositive(FP)
Negative
FalseNegative(FN)
TrueNegative(TN)
Figure10.1:ConfusionMatrix
Thusthepredictiveaccuracycanbespecifiedbythefollowingformula.
PredictiveAccuracy=(TP+TN)/(TP+TN+FP+FN).
Allclassificationtechniqueshaveapredictiveaccuracyassociatedwithapredictivemodel.Thehighestvaluecanbe100%.Inpractice,predictivemodelswithmorethan70%accuracycanbeconsideredusableinbusinessdomains,dependinguponthenatureofthebusiness.
TherearenogoodobjectivemeasurestojudgetheaccuracyofunsupervisedlearningtechniquessuchasClusterAnalysis.Thereisnosinglerightanswerfortheresultsofthesetechniques.Forexample,thevalueofthesegmentationmodeldependsuponthevaluethedecision-makerseesinthoseresults.
DataMiningTechniquesDatamaybeminedtohelpmakemoreefficientdecisionsinthefuture.Oritmaybeused to explore thedata to find interesting associativepatterns.Therighttechniquedependsuponthekindofproblembeingsolved(Figure10.2).
DataMiningTechniques
SupervisedLearning
(Predictiveabilitybasedonpastdata)
Classification–MachineLearning
DecisionTrees
NeuralNetworks
Classification-Statistics
Regression
UnsupervisedLearning
(Exploratoryanalysistodiscoverpatterns)
ClusteringAnalysis
AssociationRules
Figure10.2:ImportantDataMiningTechniques
Themostimportantclassofproblemssolvedusingdataminingareclassificationproblems.Classificationtechniquesarecalledsupervisedlearningasthereisawaytosupervisewhetherthemodelisprovidingtherightorwronganswers.Theseareproblemswheredatafrompastdecisionsisminedtoextractthefewrulesandpatternsthatwouldimprovetheaccuracyofthedecisionmakingprocessinthefuture.Thedataofpastdecisionsisorganizedandminedfordecisionrulesorequations,thatarethencodifiedtoproducemoreaccuratedecisions.
Decisiontreesarethemostpopulardataminingtechnique,formanyreasons.
1. Decisiontreesareeasytounderstandandeasytouse,byanalystsaswellasexecutives.Theyalsoshowahighpredictiveaccuracy.
2. Decisiontreesselectthemostrelevantvariablesautomaticallyoutofalltheavailablevariablesfordecisionmaking.
3. Decisiontreesaretolerantofdataqualityissuesanddonotrequiremuchdatapreparationfromtheusers.
4. Evennon-linearrelationshipscanbehandledwellbydecisiontrees.
Therearemanyalgorithmstoimplementdecisiontrees.Someofthepopular
onesareC5,CARTandCHAID.
Regressionisamostpopularstatisticaldataminingtechnique.Thegoalofregressionistoderiveasmoothwell-definedcurvetobestthedata.Regressionanalysistechniques,forexample,canbeusedtomodelandpredicttheenergyconsumptionasafunctionofdailytemperature.Simplyplottingthedatamayshowanon-linearcurve.Applyinganon-linearregressionequationwillfitthedataverywellwithhighaccuracy.Oncesucharegressionmodelhasbeendeveloped,theenergyconsumptiononanyfuturedaycanbepredictedusingthisequation.Theaccuracyoftheregressionmodeldependsentirelyuponthedatasetusedandnotatallonthealgorithmortoolsused.
ArtificialNeuralNetworks(ANN)isasophisticateddataminingtechniquefromtheArtificialIntelligencestreaminComputerScience.Itmimicsthebehaviorofhumanneuralstructure:Neuronsreceivestimuli,processthem,andcommunicatetheirresultstootherneuronssuccessively,andeventuallyaneuronoutputsadecision.Adecisiontaskmaybeprocessedbyjustoneneuronandtheresultmaybecommunicatedsoon.Alternatively,therecouldbemanylayersofneuronsinvolvedinadecisiontask,dependinguponthecomplexityofthedomain.Theneuralnetworkcanbetrainedbymakingadecisionoverandoveragainwithmanydatapoints.Itwillcontinuetolearnbyadjustingitsinternalcomputationandcommunicationparametersbasedonfeedbackreceivedonitspreviousdecisions.Theintermediatevaluespassedwithinthelayersofneuronsmaynotmakeanyintuitivesensetoanobserver.Thus,theneuralnetworksareconsideredablack-boxsystem.
ClusterAnalysisisanexploratorylearningtechniquethathelpsinidentifyingasetofsimilargroupsinthedata.Itisatechniqueusedforautomaticidentificationofnaturalgroupingsofthings.Datainstancesthataresimilarto(ornear)eachotherarecategorizedintoonecluster,whiledatainstancesthatareverydifferent(orfaraway)fromeachotherarecategorizedintoseparateclusters.Therecanbeanynumberofclustersthatcouldbeproducedbythedata.TheK-meanstechniqueisapopulartechniqueandallowstheuserguidanceinselectingtherightnumber(K)ofclustersfromthedata.Clusteringisalsoknownasthesegmentationtechnique.Ithelpsdivideandconquerlargedatasets.Thetechniqueshowstheclustersofthingsfrompastdata.Theoutputisthecentroidsforeachclusterandtheallocationofdatapointstotheircluster.Thecentroiddefinitionisusedtoassignnewdatainstancescanbeassignedtotheirclusterhomes.Clusteringisalsoapartoftheartificialintelligencefamilyoftechniques.
Associationrulesareapopulardataminingmethodinbusiness,especially
wheresellingisinvolved.Alsoknownasmarketbasketanalysis,ithelpsinansweringquestionsaboutcross-sellingopportunities.ThisistheheartofthepersonalizationengineusedbyecommercesiteslikeAmazon.comandstreamingmoviesiteslikeNetflix.com.Thetechniquehelpsfindinterestingrelationships(affinities)betweenvariables(itemsorevents).ThesearerepresentedasrulesoftheformX Y,whereXandYaresetsofdataitems.Aformofunsupervisedlearning,ithasnodependentvariable;andtherearenorightorwronganswers.Therearejuststrongerandweakeraffinities.Thus,eachrulehasaconfidencelevelassignedtoit.Apartofthemachinelearningfamily,thistechniqueachievedlegendarystatuswhenafascinatingrelationshipwasfoundinthesalesofdiapersandbeers.
MiningBigDataAsdatagrowslargerandlarger,thereareafewwaysinwhichanalyzingBigdataisdifferent.FromCausationtoCorrelation
Thereismoredataavailablethantherearetheoriesandtoolsavailabletoexplainit.Historically,theoriesofhumanbehavior,andtheoriesofuniverseingeneral,havebeenintuitedandtestedusinglimitedandsampleddata,withsomestatisticalconfidencelevel.Nowthatdataisavailableinextremelylargequantitiesaboutmanypeopleandmanyfactors,theremaybetoomuchnoiseinthedatatoarticulateandtestcleantheories.Inthatcase,itmaysufficetovalueco-occurrencesorcorrelationofeventsassignificantwithoutnecessarilyestablishingstrongcausation.FromSamplingtotheWhole
Poolingallthedatatogetherintoasinglebigdatasystemcanhelpdiscoverevents,thathelpbringaboutafullerpictureofthesituation,andhighlightthreatsoropportunitiesthatanorganizationfaces.Workingfromthefulldatasetcanenablediscoveringremotebutextremelyvaluableinsights.Forexample,ananalysisofthepurchasinghabitsofmillionscustomersandtheirbillionstransactionsattheirthousandsofstorescangiveanorganizationavast,detailedanddynamicviewofsalespatternsintheircompany,whichmaynotbeavailablefromtheanalysisofsmallsamplesofdatabyeachstoreorregion.FromDatasettoDatastream
Aflowingstreamhasaperishableandunlimitedconnotationtoit,whileadatasethasafinitudeandpermanenceaboutit.Withanygiveninfrastructure,onecanonlyconsumesomuchdataatatime.Datastreamsaremany,largeandfast.Thusonehastochoosewhichofthemanystreamsofdatadoesonewanttoengagewith.Itisequivalenttodecidingwhichstreamtofishin.Themetricsusedforanalysisofstreamstendtoberelativelysimpleandrelatetotimedimension.Mostofthemetricsarestatisticalmeasuressuchascountsandmeans.Forexample,acompanymightwanttomonitorcustomersentimentaboutitsproducts.Sotheycouldcreateasocialmedialisteningplatformthatwouldreadalltweetsandblogpostsabouttheminreal-time.Thisplatformwould(a)keepacountofpositiveandnegativesentimentmessageseveryminute,and(b)flaganymessagesthatmeritattentionsuchassendinganonlineadvertisementorpurchaseoffertothatcustomer.
DataMiningBestPracticesEffectiveandsuccessfuluseofdataminingactivityrequiresbothbusinessandtechnologyskills.Thebusinessaspectshelpunderstandthedomainandthekeyquestions.Italsohelpsoneimaginepossiblerelationshipsinthedata,andcreatehypothesestotestit.TheITaspectshelpfetchthedatafrommanysources,cleanupthedata,assembleittomeettheneedsofthebusinessproblem,andthenrunthedataminingtechniquesontheplatform.
Animportantelementistogoaftertheproblemiteratively.Itisbettertodivideandconquertheproblemwithsmalleramountsofdata,andgetclosertotheheartofthesolutioninaniterativesequenceofsteps.Thereareseveralbestpracticeslearnedfromtheuseofdataminingtechniquesoveralongperiodoftime.TheDataMiningindustryhasproposedaCross-IndustryStandardProcessforDataMining(CRISP-DM).Ithassixessentialsteps(Figure4.3):
1. BusinessUnderstanding:Thefirstandmostimportantstepindataminingisaskingtherightbusinessquestions.Aquestionisagoodoneifansweringitwouldleadtolargepayoffsfortheorganization,financiallyandotherwise.Inotherwords,selectingadataminingprojectislikeanyotherproject,inthatitshouldshowstrongpayoffsiftheprojectissuccessful.Thereshouldbestrongexecutivesupportforthedataminingproject,whichmeansthattheprojectalignswellwiththebusinessstrategy.Arelatedimportantstepistobecreativeandopeninproposingimaginativehypothesesforthesolution.Thinkingoutsidetheboxisimportant,bothintermsofaproposedmodelaswellinthedatasetsavailableandrequired.
Figure4.3:CRISP-DMDataMiningcycle
2. DataUnderstanding:Arelatedimportantstepistounderstandthedataavailableformining.Oneneedstobeimaginativeinscouringformanyelementsofdatathroughmanysourcesinhelpingaddressthehypothesestosolveaproblem.Withoutrelevantdata,thehypothesescannotbetested.
3. DataPreparation:Thedatashouldberelevant,cleanandofhighquality.It’simportanttoassembleateamthathasamixoftechnicalandbusinessskills,whounderstandthedomainandthedata.Datacleaningcantake60-70%ofthetimeinadataminingproject.Itmaybedesirabletocontinuetoexperimentandaddnewdataelementsfromexternalsourcesofdatathatcouldhelpimprovepredictiveaccuracy.
4. Modeling:Thisistheactualtaskofrunningmanyalgorithmsusingtheavailabledatatodiscoverifthehypothesesaresupported.Patienceisrequiredincontinuouslyengagingwiththedatauntilthedatayieldssomegoodinsights.Ahostofmodelingtoolsandalgorithmsshouldbeused.Atoolcouldbetriedwithdifferentoptions,suchasrunningdifferentdecisiontreealgorithms.
5. ModelEvaluation:Oneshouldnotacceptwhatthedatasaysatfirst.Itisbettertotriangulatetheanalysisbyapplyingmultipledataminingtechniques,andconductingmanywhat-ifscenarios,tobuildconfidenceinthesolution.Oneshouldevaluateandimprovethemodel’spredictiveaccuracywithmoretestdata.Whentheaccuracyhasreachedsomesatisfactorylevel,thenthemodelshouldbedeployed.
6. Disseminationandrollout:Itisimportantthatthedataminingsolutionispresentedtothekeystakeholders,andisdeployedintheorganization.Otherwisetheprojectwillbeawasteoftimeandwillbeasetbackforestablishingandsupportingadata-baseddecision-processcultureintheorganization.Themodelshouldbeeventuallyembeddedintheorganization’sbusinessprocesses.
ConclusionDataMiningislikedivingintotheroughmaterialtodiscoveravaluablefinishednugget.Whilethetechniqueisimportant,domainknowledgeisalsoimportanttoprovideimaginativesolutionsthatcanthenbetestedwithdatamining.Thebusinessobjectiveshouldbewellunderstoodandshouldalwaysbekeptinmindtoensurethattheresultsarebeneficialtothesponsoroftheexercise.
ReviewQuestions1. Whatisdatamining?Whataresupervisedandunsupervisedlearning
techniques?2. Describethekeystepsinthedataminingprocess.Whyisitimportant
tofollowtheseprocesses?3. Whatisaconfusionmatrix?4. Whyisdatapreparationsoimportantandtimeconsuming?5. Whataresomeofthemostpopulardataminingtechniques?6. HowisminingBigdatadifferentfromtraditionaldatamining?
Appendix1:HadoopInstallationonAmazonWebServices(AWS)ElasticComputeCluster(EC2)
CreatingClusterserveronAWS,InstallHadoopfromCloudEraTheobjectiveofthistutorialistosetupabigdataprocessinginfrastructureusingcloudcomputing,andHadoopandSparksoftware.
Step1:CreatingAmazonEC2Servers.
1. Openhttps://aws.amazon.com/2. ClickonServices3. ClickonEC2
YoucanseethebelowresultonceyouclickonEC2.Ifyoualreadyhaveaserveryoucanseethenumberofrunningservers,theirvolumeandotherinformation.
4. ClickonLaunchInstanceButton.
5. ClickonAWSMarketePlace6. TypeUbuntuinsearchtextbox.7. ClickonSelectbutton
8. Ubuntuisfreesoyoudon’thavetoworryabouttheservicepriceClickonContinuebutton.
9. ChooseGeneral.purposem1.largeandclickonNext:ConfigurareInstanceDetails(DonotchoosetheMicroInstancest1.microitisfreebutitwillnotabletohandletheinstallation.)
10. ClickonNext:AddStorage
11. Specifythevolumesize20GB(Defaultwillbe8butitwillnotsufficient)andClickonNext:TagInstance
12. Typethenamecs488-master(Thisisforlabeltoknowwhichoneismasterandslave)andclickonNext:SecurityGroup
13. Weneedtoopenourservertotheworldincludingmostoftheportcauseclouderaneedtoaddmoreport.SpecifythegroupnameType:ChooseCustomTCPRulePortRange0-65500Source:AnyWhereAndClickonReviewInstance
14. Themessageshowsthewarningthisisonlythatweopenourservertoworld,Soignoreitfornow.ClickonLaunchbutton.
15. TypethekeypairnameandClickonDownloadKeyPairbutton(rememberthelocationofdownloadedfileweneedthisfiletologintotheserver.)andClickonLaunchInstances.
16. Nowthemasterserveriscreated.
Now,weneedfourmoreserverstomaketheclusteringforthatwedon’tneedtodotheseprocessfourtimes.Wejustincreasethevalueofnoofinstanceweneedandwegotthe4servers.
Nowwearegoingtolaunch4moreserverwhichisslaves.
Pleaserepeatstep4-9
Gotoamazonmarketplace,chooseUbuntu,selecttheinstancetype(General.purpose)
17. Type4inNumberofInstances.Whichwillcreatethe4moreserverforus.
18. Nametheservercs488-slave
19. Selectthepreviouscreatedsecuritygroup.
20. Itisimportantthatyouneedtochoosetheexistingkeypairfortheseservertoo.
Ifeverythinggoeswell,youcanseehave5instances,5volumes,1keypair,1or2securitygroups.
Wearenowsuccessfullycreated5servers.
Step2:ConnectingserverandinstallingrequiredClouderadistributionofHadoop
Firstofalltakeanoteforallyourserverdetails,IPAddress,DNSaddress.Masterandslaves.
MasterPublicDNSAddress:ec2-54-200-210-141.us-west-2.compute.amazonaws.comMasterPrivateIPAddress:172.31.20.82
Slave1PrivateIP:172.31.26.245Slave2PrivateIP:172.31.26.242Slave3PrivateIP:172.31.26.243Slave4PrivateIP:172.31.26.244
Onceyouhavetheseinrecorded,youcanconnecttotheserver.Ifyouareusinglinuxasoperatingsystemyoucanusesshcommandfromterminaltoconnectit.
Connectingtheserver(Windows)
1. Downloadthesshsoftware(Putty)(http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html)Alsodownloadputtygentoconvertourauthenticationfile.pemto.ppk
2. Openputtygenloadtheauthenticationfile
ClickonSavePrivateKey
3. OpenPuttytypethemasterpublicdnsaddressinhostnameandthanclickonSSHfromleftpanel>ClickonAuth>>Selecttherecentconvertedauthenticationfile(.ppk)andfinallyclickonOpenbutton.
4. Nowyouwillabletoconnecttheserverpleasetype“ubuntu”thedefaultusernametologintothesystem.
5. Onceyouconnecttypethefollowingcommandintotheterminal6. sudoaptitudeupdate7. cd/usr/local/src/8. sudowgethttp://archive.cloudera.com/cm4/installer/latest/cloudera-manager-
installer.bin9. sudochmodu+xcloudera-manager-installer.bin10. sudo./cloudera-manager-installer.bin
11. Thereis4morestepwhereyouclickonNextandYesforlicenseagreement.Onceyoufinishtheinstallationyouneedtorestarttheservice.
12. sudoservicecloudera-scm-serverrestart
Youarenowabletoconnecttheclouderafromyoubrowser.Theaddresswillbehttp://<YOURPUBLICDNSSERVER>:7180eg.http://ec2-54-200-210-141.us-west-2.compute.amazonaws.com:7180anddefaultusernameandpasswordisadmin/admintologintothesystem.
Oncerestarttheserveritwillopentheloginscreenagain.Thesameusernameandpassword(admin/admin)isusetologintothesystem.
13. ClickonLaunchtheClassicwizard
14. ClickonContinue
15. ProvideallthePrivateIPaddressofmasterandslavescomputersandclickonSearchbutton.
16. ClickonContinuebutton.
17. ChooseNoneforSOLR1….AndNoneforIMPAL….AndClickonContinuebutton.
18. ClickonAnotherUser>>Type“ubuntu”andselectAllhostsacceptsameprivatekey>>uploadtheauthenticationfile.pemandclickonContinuebutton.
19. Nowclouderawillinstallthesoftwareforeachofourserver.
20. Oncetheinstallationiscompleteclickoncontinuebutton.
21. Onceitreachto100%clickoncontinuebutton.Donotdisconnectinternetnorshutthemachine,Iftheprocesswillnotcompletethatweneedtore-createthewholeprocess.Clickoncontinuebutton.
22. ClickonContinue.
23. ChooseCoreHadoopandClickonInspectRoleAssignmentsbutton
24. NowforyoumasterIPitshouldhaveonlyNameNodeselectionanduncheckedinDataNode.Thisisimportanttomakethemasterandslaveserver.
25. Nowtheclouderawillinstallthealltheservicesforyoufutureuseyoucanrecordtheusernameandpasswordofeachservices.ClickonTestConnection
26. ClickonContinue
27. Nowalltheinstallationiscompleteyoucannowhave1masternode4datanode.
28. Youshouldseethedashboard.
Step3:WordCountusingMapReduce29. Nowlogintomasterserverfromputty.30. Runthefollowingcommand31. cd~/32. mkdircode-and-data33. cdcode-and-data34. sudowgethttps://s3.amazonaws.com/learn-hadoop/hadoop-infiniteskills-
richmorrow-class.tgz35. sudotar-xvzfhadoop-infiniteskills-richmorrow-class.tgz36. cddata37. sudo-uhdfshadoopfs-mkdir/user/ubuntu38. sudo-uhdfshadoopfs-chownubuntu/user/ubuntu39. hadoopfs-putshakespeareshakespeare-hdfs40. hadoopversion41. hadoopfs-lsshakespeare-hdfs
42. sudohadoopjar/opt/cloudera/parcels/CDH-4.7.1-
1.cdh4.7.1.p0.47/share/hue/apps/oozie/examples/lib/hadoop-examples.jarwordcountshakespeare-hdfswordcount-output
43. hadoopjar/opt/cloudera/parcels/CDH-4.7.1-1.cdh4.7.1.p0.47/share/hue/apps/oozie/examples/lib/hadoop-examples.jarsleep-m10-r10-mt20000-rt20000
Appendix2:SparkInstallationandTutorial
ThistutorialwillhelpinstallSparkandgetitrunningonastandalonemachine.ItwillthenhelpdevelopasimpleanalyticalapplicationusingRlanguage.
Step1:VerifyingJavaInstallation
JavainstallationisoneofthemandatorythingsininstallingSpark.TrythefollowingcommandtoverifytheJAVAversion.
$java-version
IfJavaisalready,installedonyoursystem,yougettoseethefollowingresponse−
javaversion“1.7.0_71”
Java(TM)SERuntimeEnvironment(build1.7.0_71-b13)
JavaHotSpot(TM)ClientVM(build25.0-b02,mixedmode)
IncaseyoudonothaveJavainstalledonyoursystem,thenInstallJavabeforeproceedingtonextstep.
Step2:VerifyingScalainstallation
VerifyScalainstallationusingfollowingcommand.
$scala-version
IfScalaisalreadyinstalledonyoursystem,yougettoseethefollowingresponse−
Scalacoderunnerversion2.11.6—Copyright2002-2013,LAMP/EPFL
Incaseyoudon’thaveScalainstalledonyoursystem,thenproceedtonextstepforScalainstallation.
Step3:DownloadingScala
DownloadthelatestversionofScalabyvisitthefollowinglinkDownloadScala.Forthistutorial,weareusingscala-2.11.6version.Afterdownloading,youwillfindtheScalatarfileinthedownloadfolder.
Step4:InstallingScala
FollowthebelowgivenstepsforinstallingScala.ExtracttheScalatarfile
TypethefollowingcommandforextractingtheScalatarfile.
$tarxvfscala-2.11.6.tgzMoveScalasoftwarefiles
UsethefollowingcommandsformovingtheScalasoftwarefiles,torespectivedirectory(/usr/local/scala).
$su–
Password:
#cd/home/Hadoop/Downloads/
#mvscala-2.11.6/usr/local/scala
#exit
SetPATHforScala
UsethefollowingcommandforsettingPATHforScala.
$exportPATH=$PATH:/usr/local/scala/binVerifyingScalaInstallation
Afterinstallation,itisbettertoverifyit.UsethefollowingcommandforverifyingScalainstallation.
$scala-version
IfScalaisalreadyinstalledonyoursystem,yougettoseethefollowingresponse−
Scalacoderunnerversion2.11.6—Copyright2002-2013,LAMP/EPFL
Step5:DownloadingSpark
DownloadthelatestversionofSpark.Forthistutorial,weareusingspark-1.3.1-bin-hadoop2.6version.Afterdownloadingit,youwillfindtheSparktarfileinthedownloadfolder.
Step6:InstallingSpark
FollowthestepsgivenbelowforinstallingSpark.ExtractingSparktar
Thefollowingcommandforextractingthesparktarfile.
$tarxvfspark-1.3.1-bin-hadoop2.6.tgzMovingSparksoftwarefiles
ThefollowingcommandsformovingtheSparksoftwarefilestorespectivedirectory(/usr/local/spark).
$su–
Password:
#cd/home/Hadoop/Downloads/
#mvspark-1.3.1-bin-hadoop2.6/usr/local/spark
#exit
SettinguptheenvironmentforSpark
Addthefollowinglineto~/.bashrcfile.Itmeansaddingthelocation,wherethesparksoftwarefilearelocatedtothePATHvariable.
exportPATH=$PATH:/usr/local/spark/bin
Usethefollowingcommandforsourcingthe~/.bashrcfile.
$source~/.bashrc
Step7:VerifyingtheSparkInstallation
WritethefollowingcommandforopeningSparkshell.
$spark-shell
Ifsparkisinstalledsuccessfullythenyouwillfindthefollowingoutput.
SparkassemblyhasbeenbuiltwithHive,includingDatanucleusjarsonclasspath
UsingSpark’sdefaultlog4jprofile:org/apache/spark/log4j-defaults.properties
15/06/0415:25:22INFOSecurityManager:Changingviewaclsto:hadoop
15/06/0415:25:22INFOSecurityManager:Changingmodifyaclsto:hadoop
15/06/0415:25:22INFOSecurityManager:SecurityManager:authenticationdisabled;
uiaclsdisabled;userswithviewpermissions:Set(hadoop);userswithmodifypermissions:Set(hadoop)
15/06/0415:25:22INFOHttpServer:StartingHTTPServer
15/06/0415:25:23INFOUtils:Successfullystartedservice‘HTTPclassserver’onport43292.
WelcometoSparkversion1.4.0
UsingScalaversion2.10.4(JavaHotSpot(TM)64-BitServerVM,Java1.7.0_71)
Typeinexpressionstohavethemevaluated.
Sparkcontextavailableassc
scala>
Hereyoucanseethevideo:
HowtoinstallSpark
Youmightencounter“filespecifiednotfounderror”whenyouarefirstinstallingSPARKstandalone:
TofixthisyouhavetosetupyourJAVA_HOME
Step1:Start->run->commandprompt(cmd)
Step2:DeterminewhereisyourJDKislocated,bydefaultitisinyourC:\programfiles
Step3SelectyourJDKtouseinmycase,IwillusemyJDK_8
CopythedirectorytoyourclipboardandgotoyourCMD.Andpressenter.
Step4:AddittogeneralPATH
Andpressenter.
NowgotoyoursparkfolderandgotoBIN\spark_shell
Youhaveinstalledsparklet’strytouseit.
Step8:Application:WordCountinScala
NowwewilldoanexampleofwordcountinScala:
text_file=sc.textFile(“hdfs://…”)
counts=text_file.flatMap(lambdaline:line.split(”“))\
.map(lambdaword:(word,1))\
.reduceByKey(lambdaa,b:a+b)
counts.saveAsTextFile(“hdfs://…”)
NOTE:Ifyouareworkingonastand-aloneSpark:
Thiscounts.saveAsTextFile(“hdfs://…”)commandwillgiveyouanerrorofNullPointerException.
Solution:counts.coalesce(1).saveAsTextFile()
ForimplementingwordcloudwecoulduseRinoursparkconsole:
However,ifyouclickonSparkRstraightawayyouwillgetanerror.
Tofixthis:
Step1:Setuptheenvironmentvariables.
InthePATHVariableaddyourpath:Iadded->;C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\;C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\sbin;C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\bin
Step2:InstallRsoftwareandRstudio.ThenaddthepathofRsoftwarepathtothePATHvariable.
Iaddedthistomyexistingpath->;C:\ProgramFiles\R\R-3.2.2\bin\x64\(Remembereachpaththatyouaddmustbeseparatedbysemicolonandnospacesplease)
Step3:Runcommandpromptasanadministrator.
Step4:Nowexecutethecommand>“SparkR”fromthecommandprompt.Ifsuccessfulyoushouldseemessage“Sparkcontextisavailable…”asseen
below.IfyoupathisnotsetcorrectlyyoucanalternativelynavigatetothelocationwhereyouhavedownloadedSparkR.Inmycase(C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\bin)andexecute“SparkR”Command.
Step5:ConfigurationinsidetheRStudiotoconnecttoSpark!
ExecutethebelowthreecommandsinRstudioeverytime:
#HerewearesettingupSPARK_HOMEenvironmentvariable
Sys.setenv(SPARK_HOME=“C:/spark-1.5.1-bin-hadoop2.6/spark-1.5.1-bin-hadoop2.6”)
#Setthelibrarypath
.libPaths(c(file.path(Sys.getenv(“SPARK_HOME”),“R”,“lib”),.libPaths()))
#LoadingtheSparkRLibary
library(SparkR)
IfyouseethebelowmessagethenyouareallsettostartworkingwithSparkR
Nowlet’sStartCodinginR:
lords<-Corpus(DirSource(“temp/”))
Toseewhat’sinthatcorpus,typethecommand
inspect(lords)
Thisshouldprintoutcontentsonthemainscreen.Next,weneedtocleanitup.Executethefollowinginthecommandline,onelineatatime:
lords<-tm_map(lords,stripWhitespace)
lords<-tm_map(lords,tolower)
lords<-tm_map(lords,removeWords,stopwords(“english”))
lords<-tm_map(lords,stemDocument)
Thetm_mapfunctioncomeswiththetmpackage.Thevariouscommandsareself-explanatory:stripunnecessarywhitespace,converteverythingtolowercase(otherwisethewordcloudmighthighlightcapitalisedwordsseparately),removeEnglishcommonwordslike‘the’(so-called‘stopwords’),andcarryouttextstemmingforthefinaltidy-up.DependingonwhatyouwanttoachieveyoucouldalsoexplicitlyremovenumbersandpunctuationwiththeremoveNumbersandremovePunctuationarguments.
Itispossiblethatyoumaygeterrormessageswhilstexecutingsomeofthecommands,e.g.missingpackages.IfsoinstalltheseasoutlinedaboveinStep4,andrepeat
Ifalliswellthenyoushouldnowbereadytocreateyourfirstwordcloud!Trythis:
wordcloud(lords,scale=c(5,0.5),max.words=100,random.order=FALSE,rot.per=0.35,use.r.layout=FALSE,colors=brewer.pal(8,“Dark2”))
AdditionalResourcesHerearesomeotherbooks,papers,videoandotherresources,foradeeperdiveintothetopicscoveredinthisbook.
1. Mayer-Schonberger,Viktor;Cukier,Kenneth(2013).BigData:ARevolutionThatWillTransformHowWeLive,Work,andThink.HoughtonMifflinHarcourt.
2. McKinseyGlobalInstituteReport(2011).Bigdata:Thenextfrontierforinnovation,competition,andproductivity.Mckinsey.com
3. Silver,N.(2012).TheSignalandtheNoise:WhySoManyPredictionsFailbutSomeDon’t.PenguinPress.
4. MateiZahariaandet.Al.(2010).“ResilientDistributedDatasets:AFault-TolerantAbstractionforIn-MemoryClusterComputing,”UniversityofCalifornia,Berkeley.OReilley.
5. SandyRyza,UriLasersonet.al(2014).“Advanced-Analytics-with-Spark”.OReilley.
Websites:
6. ApacheHadoopresources:https://hadoop.apache.org/docs/r2.7.2/7. ApacheHDFS:https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html8. HadoopAPIsite:http://hadoop.apache.org/docs/current/api/9. ApacheSpark:http://spark.apache.org/docs/latest/
10.https://www.biostat.wisc.edu/~kbroman/Rintro/Rwinpack.html11.http://robjhyndman.com/hyndsight/building-r-packages-for-windows/12.https://stevemosher.wordpress.com/ten-steps-to-building-an-r-package-under-windows/13.http://www.inside-r.org/packages/cran/wordcloud/docs/wordcloud14.https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html15.https://intellipaat.com/tutorial/spark-tutorial/16.https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces5017.https://en.wikipedia.org/wiki/NoSQL
18.http://www.planetcassandra.org/what-is-apache-cassandra/19.http://www.datastax.com/nosql20.https://www.sitepen.com/blog/2010/05/11/nosql-architecture/21.http://nosql-database.org/22.http://webpages.uncc.edu/xwu/5160/nosqldbs.pdf
VideoResources
23.DougCuttingon‘Hadoopat10’:https://www.youtube.com/watch?v=yDZRDDu3CJo24.StatusofApachecommunity:https://www.youtube.com/watch?v=sOZnf8Nn3Fo.25.Spark2.0updatesshowinganicedemoacrossR,ScalaandSQL)usingtweetsandclustering.https://www.youtube.com/watch?v=9xSz0ppBtFg26.https://www.youtube.com/watch?v=VwiGHUKAHWM27.https://www.youtube.com/watch?v=L5QWO8QBG5c28.https://www.youtube.com/watch?v=KvQto_b3sqw29.https://www.youtube.com/watch?v=YW28qItH_tA
AbouttheAuthorDr.AnilMaheshwariisaProfessorofComputerScienceandInformationSystems,andtheDirectorofCenterforDataAnalytics,atMaharishiUniversityofManagement.Heteachescoursesindataanalytics,andhelpswithextractingdeepinsightsfromtheirdata.HeworkedinavarietyofleadershiprolesatIBMinAustinTX,andhasalsoworkedatmanyothercompaniesincludingstartups.
HehastaughtattheUniversityofCincinnati,CityUniversityofNewYork,UniversityofIllinois,andothers.HeearnedanElectricalEngineeringdegreefromIndianInstituteofTechnologyinDelhi,anMBAfromIndianInstituteofManagementinAhmedabad,andaPh.D.fromCaseWesternReserveUniversity.HeisapractitionerofTranscendentalMeditationtechnique.
Heistheauthorofthe#1bestsellerDataAnalyticsMadeAccessible.
HeblogsinterestingstuffonITandEnlightenmentatanilmah.com
Instructorscanreachhimforcoursematerialsatakm2030@gmail.com.Speakingengagementsarewelcome.