Spark tutorial py con 2016 part 2
-
Upload
david-taieb -
Category
Data & Analytics
-
view
3.931 -
download
0
Transcript of Spark tutorial py con 2016 part 2
![Page 1: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/1.jpg)
DavidTaiebSTSM-IBMCloudDataServicesDeveloperadvocatedavid_taieb@us.ibm.com
HANDS-ONSESSION:DEVELOPINGANALYTICAPPLICATIONSUSINGAPACHESPARK™ANDPYTHONPart2:AnalyzingcartwiQerdatawithSparkandDashDb PyCon2016,Portland
![Page 2: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/2.jpg)
©2016IBMCorpora6on �
Agenda
• Provisiontheapplica6onservicesonBluemix:Spark,DashDb,IBMInsightforTwiJer
• LoadcarrelatedtweetsfromIBMInsightforTwiJerintoDashDbwarehouse
• RunAnaly6csinPythonNotebookanddiscovernewinsights
![Page 3: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/3.jpg)
©2016IBMCorpora6on �
Sign up for Bluemix • AccessIBMBluemixwebsiteonhJps://console.ng.bluemix.net• ClickonGetStartedforFree
• CompletetheformandclickCreateaccount• Lookforconfirma6onemailandclickonconfirmyouaccountlink
CreatenewSpace
![Page 4: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/4.jpg)
©2016IBMCorpora6on �
Create a new space on Bluemix Inprepara6onforrunningtheproject,wecreateanewspaceonBluemix
CreateaSparkInstance
Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse
![Page 5: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/5.jpg)
©2016IBMCorpora6on �
Create a Spark Instance
Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse
![Page 6: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/6.jpg)
©2016IBMCorpora6on �
Create New Spark Instance Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse
![Page 7: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/7.jpg)
©2016IBMCorpora6on �
Acquiring the data
• Inthenextsec6on,weshowhowtoacquirethetwiJerdataandstoreitintoDashDb.
• WeusetheTwiJerloadingconnectoravailableasamenuinDashDbconsole
CreateaDashDbinstance
![Page 8: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/8.jpg)
©2016IBMCorpora6on �
Create an instance of IBM Dash DB on Bluemix
CreateanIBMInsightforTwiJerinstance
![Page 9: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/9.jpg)
©2016IBMCorpora6on �
Create an instance of IBM Insight for Twitter on Bluemix
![Page 10: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/10.jpg)
©2016IBMCorpora6on �
Agenda
• Provisiontheapplica6onservicesonBluemix:Spark,DashDb,IBMInsightforTwiJer
• LoadcarrelatedtweetsfromIBMInsightforTwiJerintoDashDbwarehouse
• RunAnaly6csinPythonNotebookanddiscovernewinsights
![Page 11: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/11.jpg)
©2016IBMCorpora6on �
Launch DashDb Console ClickontheDashDbService6letoopenthisdashboard,thenclickonLaunchbuJon
LoadTwiJerData
![Page 12: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/12.jpg)
©2016IBMCorpora6on �
Load Twitter Data
DashDbConsoleofferedmul6pledataconnectorsincludingaTwiJerconnectorthatautoma6callyconnectstoIBMInsightforTwiJer
ConnecttoTwiJer
![Page 13: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/13.jpg)
©2016IBMCorpora6on �
Connect to Twitter
ReusingtheTwiJerserviceinstancecreatedinpreviousstep
![Page 14: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/14.jpg)
©2016IBMCorpora6on �
Select the data to be loaded TwiJerQuerybeingused:posted:2015-01-01,2015-12-31followers_count:2000listed_count:1000(volkswagenORvwORtoyotaORdaimlerORmercedesORbmwORgmOR"generalmotors"ORtesla)
SpecifytwiJerquery
Providepreviewcountofoutputdata
![Page 15: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/15.jpg)
©2016IBMCorpora6on �
Select the DashDb Table
Nameoftheschemaunderwhichthetableswillbecreated
Prefix(Namespace)forthecreatedtables
Listoftablesthatwillbecreated
![Page 16: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/16.jpg)
©2016IBMCorpora6on �
Loading data monitoring page
Warning:loading6memayvarybasedonbandwidth.Itmaytakebetween15mnsand1hour
![Page 17: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/17.jpg)
©2016IBMCorpora6on �
Complete the load: Statistics
![Page 18: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/18.jpg)
©2016IBMCorpora6on �
Complete the load: explore the data
![Page 19: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/19.jpg)
©2016IBMCorpora6on �
Get connection information CopytheUserid,passwordandjdbcurl,you’llneedthisinforma6onlater
![Page 20: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/20.jpg)
©2016IBMCorpora6on �
Agenda
• Provisiontheapplica6onservicesonBluemix:Spark,DashDb,IBMInsightforTwiJer
• LoadcarrelatedtweetsfromIBMInsightforTwiJerintoDashDbwarehouse
• RunAnaly6csinPythonNotebookanddiscovernewinsights
![Page 21: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/21.jpg)
©2016IBMCorpora6on �
Create new Notebook from URL
ImportrequiredPythonpackages
• CreatenotebookfromURL• UsehJps://github.com/ibm-cds-labs/spark.samples/raw/master/notebook/DashDB%20TwiJer%20Car%202015%20Python%20Notebook.ipynb
![Page 22: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/22.jpg)
©2016IBMCorpora6on �
Step 1: Import Python Packages • Installnltkpackage(Naturallanguagetoolkit)• Wewilluseittofilterstopwordslaterinthetutorial
![Page 23: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/23.jpg)
©2016IBMCorpora6on �
Import Python modules and setup the SQLContext
![Page 24: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/24.jpg)
©2016IBMCorpora6on �
Step 2: Define global Variables
Setupvariousdatastructureswe’llneedthroughouttheNotebook
ThisistheSCHEMAandPREFIXyouusedinStep3oftheTwiJerconnectorwizard
![Page 25: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/25.jpg)
©2016IBMCorpora6on �
Set up some global helper functions
JavaScriptGooglemapvisualiza6on
Mischelperthatfillinmissingdates
![Page 26: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/26.jpg)
©2016IBMCorpora6on �
Step 3: Acquire the data from DashDB
UserIDandpasswordfromConnec6onpage
UserIDandpasswordfromConnec6onpage
![Page 27: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/27.jpg)
©2016IBMCorpora6on �
Join the Tweets and Sentiment Table Inthisstep,wewanttoaddasen6mentscoreforeachtweetrecord:• JointheTweetsandSen6mentstable• Encodethesen6mentintoanumbere.g.POSITIVE=+1,NEGATIVE=-1,AMBIVALENT=0• Createanaverageforeachsen6mentassociatedwithatweet• %6meinstrumentsthecodetoprovideprofileexecu6onstats.
![Page 28: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/28.jpg)
©2016IBMCorpora6on �
Step 4: Transform the data
CreateacleanWorkingdataframethatwillbeeasiertouseinouranaly6cs
![Page 29: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/29.jpg)
©2016IBMCorpora6on �
Step 5: Geographic distribution of tweets
GroupBycountriesandaggregatethetweetscount
ConvertSparkSQLdataframetoPandasdatastructureforvisualiza6on
![Page 30: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/30.jpg)
©2016IBMCorpora6on �
Bar chart visualization of Tweet distribution by Geo
![Page 31: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/31.jpg)
©2016IBMCorpora6on �
Google map visualization of tweet distribution by Geos
CallGeoChartHelperthatsetuptheJavaScriptcode
![Page 32: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/32.jpg)
©2016IBMCorpora6on �
Clean up memory before next analytics
ResourcesincludingmemoryontheSparkDrivermachinearenotinfinite.Itisgoodprac6cetocleanupwhendataisnotneededanymore
![Page 33: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/33.jpg)
©2016IBMCorpora6on �
Step 6: Analyzing tweets sentiment
GroupBySen6mentsandaggregatethetweetscount
ConvertSparkSQLdataframetoPandasdatastructureforvisualiza6on
![Page 34: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/34.jpg)
©2016IBMCorpora6on �
Sentiment visualization
UseMatplotpiechart
![Page 35: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/35.jpg)
©2016IBMCorpora6on �
Step 7: Analyze Tweet timeline
ConvertSparkSQLdataframetoPandasdatastructureforvisualiza6on
GroupByPos6ng6meandsen6menttuplesAggregatethetweetcounts
GroupByPos6ng6meandsen6menttuplesAggregatethesumofthetweetcounts
![Page 36: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/36.jpg)
©2016IBMCorpora6on �
Prepare the timeline data structures
![Page 37: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/37.jpg)
©2016IBMCorpora6on �
Time series visualization for all tweets
![Page 38: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/38.jpg)
©2016IBMCorpora6on �
Deep dive into car manufacturers
CreatenewDataFramethatenrichtweetswithextrametadata:-Booleanforeachcarmanufacturer-Booleanforelectriccar-Booleanforselfdrivingcar
![Page 39: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/39.jpg)
©2016IBMCorpora6on �
Re-analyze tweeter timeline for each car manufacturer
CreatenewDataFrameforeachcarmanufacturerAggregatethetweetcounts,orderbypos6ng6me
![Page 40: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/40.jpg)
©2016IBMCorpora6on �
Timeline series visualization
No6cethepeakoftweetsforVWbetweenSeptemberandOctober2015
![Page 41: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/41.jpg)
©2016IBMCorpora6on �
Explain why the peak of tweets for VW between September and October 2015
FilterforallVWtweetsbetweenSeptandOct2015
Piechartvisualiza6onofthetop10wordsbeingusedinthesetweets
Createmapcountofallnon-stopwordsusedinthetweets
UseNLTKstopwordsmoduletofilteroutstopwords
![Page 42: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/42.jpg)
©2016IBMCorpora6on �
Peak explained
WecanclearlyseefromthelistofmostusedwordsthatthepeakcorrespondtotheVWscandalaroundfraudulentemissionstes6ng
![Page 43: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/43.jpg)
©2016IBMCorpora6on �
Follow the notebook for many more interesting analytics
![Page 44: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/44.jpg)
©2016IBMCorpora6on �
Resource
• hJps://developer.ibm.com/clouddataservices/• hJps://github.com/ibm-cds-labs/simple-data-pipe• hJps://github.com/ibm-cds-labs/pipes-connector-flightstats• hJp://spark.apache.org/docs/latest/mllib-guide.html• hJps://console.ng.bluemix.net/data/analy6cs/
![Page 45: Spark tutorial py con 2016 part 2](https://reader034.fdocuments.us/reader034/viewer/2022042619/58e7b8371a28ab65578b5559/html5/thumbnails/45.jpg)
©2016IBMCorpora6on �
Thank You