DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001:...
Transcript of DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001:...
![Page 1: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/1.jpg)
1ter2018
DS 3001: Foundations of Data Science (Spark)
Apdapted from Cloudera and Stanford Univ
![Page 2: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/2.jpg)
2
ThePlan
• GettingstartedwithSpark• RDDs• Commonlyusefuloperations• Usingpython• UsingJava• UsingScala• Helpsession
![Page 3: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/3.jpg)
3
GodownloadSparknow.(Version2.2.1forHadoop2.7)https://spark.apache.org/downloads.html
![Page 4: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/4.jpg)
4
ThreeDeploymentOptions
DriverTask
Task
Task
Local
Driver
Stand-aloneCluster
ClusterManager
Executor
Task
Task
Executor
Task
Driver
ManagedCluster
ClusterManager
Executor
Task
Task
Executor
Task
YARN
![Page 5: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/5.jpg)
5
ThreeDeploymentOptions
DriverTask
Task
Task
Local
Driver
Stand-aloneCluster
ClusterManager
Executor
TaskTask
Executor
Task
Driver
ManagedCluster
ClusterManager
Executor
TaskTask
Executor
Task
YARN
![Page 6: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/6.jpg)
6
ResilientDistributedDatasets
• SparkisRDD-centric• RDDsareimmutable• RDDsarecomputedlazily• RDDscanbecached• RDDsknowwhotheirparentsare• RDDsthatcontainonlytuplesoftwoelementsare“pairRDDs”
![Page 7: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/7.jpg)
7
UsefulRDDActions
• take(n)– returnthefirstnelementsintheRDDasanarray.• collect()– returnallelementsoftheRDDasanarray.Usewithcaution.• count()– returnthenumberofelementsintheRDDasanint.• saveAsTextFile(‘path/to/dir’)– savetheRDDtofilesinadirectory.Willcreatethedirectoryifitdoesn’texistandwillfailifitdoes.• foreach(func)– executethefunctionagainsteveryelementintheRDD,butdon’tkeepanyresults.
![Page 8: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/8.jpg)
8
UsefulRDDOperations
![Page 9: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/9.jpg)
9
map()
ApplyanoperationtoeveryelementofanRDDandreturnanewRDDthatcontainstheresults
>>> data = sc.textFile(’path/to/file’)>>> data.take(3)[u’Apple,Amy’, u’Butter,Bob’, u’Cheese,Chucky’]>>> data.map(lambda line: line.split(‘,’)).take(3)[[u’Apple’, u‘Amy’], [u’Butter’, u’Bob’], [u’Cheese’, u‘Chucky’]]
![Page 10: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/10.jpg)
10
flatMap()
ApplyanoperationtoeveryelementofanRDDandreturnanewRDDthatcontainstheresultsafterdroppingtheoutermostcontainer
>>> data = sc.textFile(’path/to/file’)>>> data.take(3)[u’Apple,Amy’, u’Butter,Bob’, u’Cheese,Chucky’]>>> data.flatMap(lambda line: line.split(‘,’)).take(6)[u’Apple’, u‘Amy’, u’Butter’, u’Bob’, u’Cheese’, u‘Chucky’]
![Page 11: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/11.jpg)
11
mapValues()
ApplyanoperationtothevalueofeveryelementofanRDDandreturnanewRDDthatcontainstheresults.OnlyworkswithpairRDDs
>>> data = sc.textFile(’path/to/file’)>>> data = data.map(lambda line: line.split(‘,’))>>> data = data.map(lambda pair: (pair[0], pair[1]))>>> data.take(3)[(u’Apple’, u‘Amy’), (u’Butter’, u’Bob’), (u’Cheese’, u‘Chucky’)]>>> data.mapValues(lambda name: name.lower()).take(3)[(u’Apple’, u‘amy’), (u’Butter’, u’bob’), (u’Cheese’, u‘chucky’)]
![Page 12: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/12.jpg)
12
flatMapValues()
ApplyanoperationtothevalueofeveryelementofanRDDandreturnanewRDDthatcontainstheresultsafterremovingtheoutermostcontainer.OnlyworkswithpairRDDs>>> data = sc.textFile(’path/to/file’)>>> data = data.map(lambda line: line.split(‘,’))>>> data = data.map(lambda pair: (pair[0], pair[1])).take(3)>>> data.take(3)[(u’Apple’, u‘Amy’), (u’Butter’, u’Bob’), (u’Cheese’, u‘Chucky’)]>>> data.flatMapValues(lambda name: name.lower()).take(3)[(u’Apple’, u‘a’), (u’Apple’, u’m’), (u’Apple’, u‘y’)]
![Page 13: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/13.jpg)
13
filter()
ReturnanewRDDthatcontainsonlytheelementsthatpassafilteroperation
>>> import re>>> data = sc.textFile(’path/to/file’)>>> data.take(3)[u’Apple,Amy’, u’Butter,Bob’, u’Cheese,Chucky’]>>> data.filter(lambda line: re.match(r’^[AEIOU]’, line)).take(3)[u’Apple,Amy’, u’Egg,Edward’, u’Oxtail,Oscar’]
![Page 14: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/14.jpg)
14
groupByKey()
ApplyanoperationtothevalueofeveryelementofanRDDandreturnanewRDDthatcontainstheresultsafterremovingtheoutermostcontainer.OnlyworkswithpairRDDs>>> data = sc.textFile(’path/to/file’)>>> data = data.map(lambda line: line.split(‘,’))>>> data = data.map(lambda pair: (pair[0], pair[1]))>>> data.take(3)[(u’Apple’, u‘Amy’), (u’Butter’, u’Bob’), (u’Cheese’, u‘Chucky’)]>>> data.groupByKey().take(1)[(u’Apple’, <pyspark.resultiterable.ResultIterable object at 0x102ed1290>)]>>> for pair in data.groupByKey().take(1):... print “%s:%s” % (pair[0], “,”.join([n for n in pair[1]))Apple:Amy,Adam,Alex
![Page 15: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/15.jpg)
15
reduceByKey()
CombineelementsofanRDDbykeyandthenapplyareduceoperationtopairsofkeysuntilonlyasinglekeyremains.ReturntheresultinanewRDD.
>>> data = sc.textFile(’path/to/file’)>>> data = data.map(lambda line: line.split(‘,’))>>> data = data.map(lambda pair: (pair[0], pair[1]))>>> data.take(3)[(u’Apple’, u‘Amy’), (u’Butter’, u’Bob’), (u’Cheese’, u‘Chucky’)]>>> data.reduceByKey(lambda v1, v2: v1 + “:” + v2).take(1)[(u’Apple’, u’Amy:Alex:Adam’)]
![Page 16: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/16.jpg)
16
sortBy()
SortanRDDaccordingtoasortingfunctionandreturntheresultsinanewRDD.
>>> data = sc.textFile(’path/to/file’)>>> data = data.map(lambda line: line.split(‘,’))>>> data = data.map(lambda pair: (pair[0], pair[1]))>>> data.take(3)[(u’Apple’, u‘Amy’), (u’Butter’, u’Bob’), (u’Cheese’, u‘Chucky’)]>>> data.sortBy(lambda pair: pair[1]).take(3)[(u’Avocado’, u’Adam’), (u‘Anchovie’, u’Alex’), (u’Apple’, u’Amy’)]
![Page 17: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/17.jpg)
17
sortByKey()
SortanRDDaccordingtothenaturalorderingofthekeysandreturntheresultsinanewRDD.
>>> data = sc.textFile(’path/to/file’)>>> data = data.map(lambda line: line.split(‘,’))>>> data = data.map(lambda pair: (pair[0], pair[1]))>>> data.take(3)[(u’Apple’, u‘Amy’), (u’Butter’, u’Bob’), (u’Cheese’, u‘Chucky’)]>>> data.sortByKey().take(3)[(u’Apple’, u’Amy’), (u‘Anchovie’, u’Alex’), (u’Avocado’, u’Adam’)]
![Page 18: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/18.jpg)
18
subtract()
ReturnanewRDDthatcontainsalltheelementsfromtheoriginalRDDthatdonotappearinatargetRDD.
>>> data1 = sc.textFile(’path/to/file1’)>>> data1.take(3)[u’Apple,Amy’, u’Butter,Bob’, u’Cheese,Chucky’]>>> data2 = sc.textFile(’path/to/file2’)>>> data2.take(3)[u’Wendy’, u’McDonald,Ronald’, u’Cheese,Chucky’]>>> data1.subtract(data2).take(3)[u’Apple,Amy’, u’Butter,Bob’, u’Dinkel,Dieter’]
![Page 19: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/19.jpg)
19
join()
ReturnanewRDDthatcontainsalltheelementsfromtheoriginalRDDjoined(innerjoin)withelementsfromthetargetRDD.
>>> data1 = sc.textFile(’path/to/file1’).map(lambda line: line.split(‘,’)).map(lambda pair: (pair[0], pair[1]))>>> data1.take(3)[(u’Apple’, u’Amy’), (u’Butter’, u’Bob’), (u’Cheese’, u’Chucky’)]>>> data2 = sc.textFile(’path/to/file2’).map(lambda line: line.split(‘,’)).map(lambda pair: (pair[0], pair[1]))>>> data2.take(3)[(u’Doughboy’, u’Pilsbury’), (u’McDonald’, u’Ronald’), (u’Cheese’, u’Chucky’)]>>> data1.join(data2).collect()[(u’Cheese’, (u’Chucky’, u’Chucky’))]>>> data1.fullOuterJoin(data2).take(2)[(u’Apple’,(u’Amy’, None)), (u’Cheese’, (u’Chucky’, u’Chucky’))]
![Page 20: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f63b7495fc82b715e6a629c/html5/thumbnails/20.jpg)
20
Thankyou