1 ©HortonworksInc.2011–2016.AllRightsReserved
SparkSQL+Pig-La.nCombineQueryLanguageandDataFlowLanguageforDataScience
JeffZhang([email protected])May16,2017
2 ©HortonworksInc.2011–2016.AllRightsReserved
WhoamI
à ASFMember,workinASFforalmost8years
à CommiRerofApacheTez,Pig&Zeppelin
à WorksinHortonworks
3 ©HortonworksInc.2011–2016.AllRightsReserved
DataScience
DataScience,alsoknownasdata-drivenscience,isaninterdisciplinaryfieldaboutscienYficmethods,processesandsystemstoextractknowledgeorinsightsfromdatainvariousforms,eitherstructuredorunstructured.
à Describewhathappens
à Explainwhathappens
à Predictwhatwouldhappen
4 ©HortonworksInc.2011–2016.AllRightsReserved
DataScience
CollectData
DataMunging
DataAnalysisInsight
Product
online offline
5 ©HortonworksInc.2011–2016.AllRightsReserved
DataMunging
§ CollectandTransformServerLogData• UserAgentNormalizaYon• RobotDetecYon• Sessionize
§ MovedatafromDatabasetoHDFS
§ CollectandTransformSocialMediaData
6 ©HortonworksInc.2011–2016.AllRightsReserved
DataMunging
BeforeDataMunging AcerDataMunging
7 ©HortonworksInc.2011–2016.AllRightsReserved
DataAnalysis
à CombinedifferentsourcesofdataandapplystaYsYcs,BItoolstogetinsightfromData– WebTrafficMetrics– UserSegmentaYonAnalysis– A/BTest
8 ©HortonworksInc.2011–2016.AllRightsReserved
DataMungingvsDataAnalysis
DataMunging DataAnalysisDataSource Messy
Structured/UnstructuredUnorganized
Clean,NormalizedStructuredOrganized
Stability Regular,Stable Ad-hoc
Tools Python,Spark,Hadoopandetc.
R,Python,SQLandetc.
Datayouhavetobefullstackbigdataengineertododatascience?
Whatifyouareadataanalystwithoutmuchprogrammingskills?
9 ©HortonworksInc.2011–2016.AllRightsReserved
DataScienceInfrastructure
10 ©HortonworksInc.2011–2016.AllRightsReserved
WhatisSpark
ApacheSparkisafast,in-memorydataprocessingenginewithelegantandexpressivedevelopmentAPIstoallowdataworkerstoefficientlyexecutestreaming,machinelearningorSQLworkloads.
11 ©HortonworksInc.2011–2016.AllRightsReserved
WhatisApachePig
à ApachePigisahigh-levelplajormforcreaYngprogramsthatrunonApacheHadoop.ThelanguageforthisplajormiscalledPigLa.n.PigcanexecuteitsHadoopjobsinMapReduce,ApacheTez,orApacheSpark
• Easeofprogramming
• OpYmizaYonopportuniYes
• Extensibility
12 ©HortonworksInc.2011–2016.AllRightsReserved
WordCount
Load
ForEach Group ForEach Order
StoreUsingSQL?
13 ©HortonworksInc.2011–2016.AllRightsReserved
Pig-La.nvsSQL
SQL Pig-La.nLanguageType QueryLanguage
• defactorstandard
DataFlowLanguage• lazyevaluaYon• supportpipelinesplit
DataSource StructuredData Structured/UnstructuredIntegraYon IntegratedwithmostofBITools VeryfewBItoolsintegratedwith
Pig-LaYn
Conclusion• Pig-La.nforDataMunging• SQLforDataAnalysis
14 ©HortonworksInc.2011–2016.AllRightsReserved
IntegrateSparkintoPig
LogicPlan
PhysicalPlan
Execu.onPlan
Execu.onEngine
PigScript
15 ©HortonworksInc.2011–2016.AllRightsReserved
CombineSparkSQL+Pig-La.n
SparkDataFrameTable
SparkSQL
DataMunging
DataAnalysis
SparkScalaAPI
SparkPythonAPI
SparkRAPI
PigLa.n
16 ©HortonworksInc.2011–2016.AllRightsReserved
Pig-Lain+SparkSQL
SparkDataFrameTable
SparkSQL
Load Store
DataMunging
DataAnalysis
17 ©HortonworksInc.2011–2016.AllRightsReserved
SparkTable(bank)
PigLaYn
SQL
18 ©HortonworksInc.2011–2016.AllRightsReserved
WheretorunPig-La.n&SparkSQL(Zeppelin)
ApacheZeppelinisaweb-basednotebookthatenablesinteracYvedataanalyYcs.YoucanmakebeauYfuldata-driven,interacYveandcollaboraYvedocumentswithSQL,Scalaandmore.
19 ©HortonworksInc.2011–2016.AllRightsReserved
JVM
ZeppelinServer
PigInterpreterGroup
Pig-LaYn SparkSQL
JVM
JVM
SparkInterpreterGroup
Scala Python R
Pig-LaYn+SparkSQLinZeppelin
20 ©HortonworksInc.2011–2016.AllRightsReserved
DataScienceInfrastructure(Recap)
21 ©HortonworksInc.2011–2016.AllRightsReserved
Demo
22 ©HortonworksInc.2011–2016.AllRightsReserved
Conclusion
à LeveragethepowerofbothQueryLanguageandDataFlowLanguage
à UseSparkasUnifiedExecuYonEngine.
à ShareDatabetweenDataMunging&DataAnalysis
à UseZeppelinasUnifiedDataSciencePlajorm
23 ©HortonworksInc.2011–2016.AllRightsReserved
Summary
à DataMunging&DataAnalysis
à UsePig-LaYnforDataMunging,UseSQLforDataAnalysis
à RununderSparkEngine
à UseZeppelinasunifiedDataSciencePlajorm
24 ©HortonworksInc.2011–2016.AllRightsReserved
CurrentStatus&What’sNext
à Status– PIG-5080(Supportstorealiasassparktable)– ZEPPELIN-2232(SupportSparkSQLforPigInterpreter)
à Next– IntegrateSparkMLlibinPig– UseDataFrameAPIinsteadofRDDAPItointegrateSparkwithPig– SupporttoIntegratePigwithotherSparkAPIs,likeR,Python
25 ©HortonworksInc.2011–2016.AllRightsReserved
Q&A
26 ©HortonworksInc.2011–2016.AllRightsReserved
ThankYou
Top Related