Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team,...
Transcript of Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team,...
![Page 1: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/1.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Python & Spark PTT18/19Prof. Dr. Ralf Lämmel
Msc. Johannes HärtelMsc. Marcel Heinz
![Page 2: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/2.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
The ‘Big Picture’
[Aggarwal15]
![Page 3: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/3.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Plenty of Building Blocks are involved in this ‘Big
Picture’
![Page 4: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/4.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Back to the ‘Big Picture’
[Aggarwal15]
![Page 5: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/5.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Foundations
![Page 6: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/6.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Technologies and APIsThere are several technologies and APIs related to data-analysis in Python but the most convenient one is Pandas.
The following tutorial is inspired by the Book ‘Python for data Analysis’ [McKinney12].
![Page 7: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/7.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What is contained in this CSV?Some imports and configuration needed to read and print a CSV with Pandas.
CSV File
Python
Jack Nicholson
(angry)
![Page 8: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/8.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What is contained in this CSV?Reading and printing CSV data with Pandas.
![Page 9: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/9.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What are the first 5 ratings in this CSV?Selecting a range of rows returns another Dataframe.
![Page 10: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/10.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What is the title a rating refers to?Selecting one column returns a Series (╯°□°)╯︵ ┻━┻
![Page 11: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/11.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What is the gender and the genre of a rating?Selecting columns by passing a list returns a Dataframe ┬──┬◡ノ(° -°ノ)
![Page 12: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/12.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What are ratings of female persons?First we need a condition for filtering. Such condition can be stated as a Series of booleans.
![Page 13: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/13.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What are ratings of female persons?We can use this condition as a selection mechanism for rows.
![Page 14: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/14.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What is the amount of female and male ratings?Let’s try this!
![Page 15: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/15.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What is the amount of female and male ratings?But we can also use dedicated Pandas functionality to create a Series that is indexed by the the distinct values.
![Page 16: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/16.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What is the amount of female and male ratings?… and we can make python plot this.
![Page 17: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/17.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What is the average rating given by a user?First we need to group the ratings of users. The following shows how to get all ratings of one user.
![Page 18: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/18.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What is the average rating given by a user?After grouping we can select the rating column and take the mean for each group.
![Page 19: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/19.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What is the average rating given by a user?We can also create a summarization in terms of a boxplot.
![Page 20: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/20.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What is a gender’s average rating of a film?A pivot table species rows and columns and aggregates the values using a passed function.
![Page 21: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/21.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What are the top female rated films?i) We filter out films below a rating count of 250 to concentrate on the important candidates. ii) We increase the max rows since this is serious data! iii) We sort by column ‘F’ containing the average female ratings.
![Page 22: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/22.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What are the top female rated films?
![Page 23: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/23.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What is the film with the biggest disagreement in female and male rating?We add a new column to the ‘film_mean_ratings’ Dataframe assigned to the difference between the female and male column.
![Page 24: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/24.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What is the film with the biggest disagreement in female and male rating?
![Page 25: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/25.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What is the movies with the most disagreement among all viewers?The standard deviation can be used to describe such disagreement in ratings.
![Page 26: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/26.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
What is the movie with the most disagreement among all viewers?
![Page 27: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/27.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Back to the ‘Big Picture’
[Aggarwal15]
![Page 28: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/28.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Data
![Page 29: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/29.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Data Integration (JSON)JSON data can be loaded from a file and accessed comparable to dictionaries.
JSONFile
Python
cf. [web_json]
![Page 30: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/30.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Data Integration (SQL)An sqlite package provides, for instance, an in-memory database.
cf. [web_sql]
![Page 31: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/31.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Data Integration (CSV)Some CSV data needs to be combined before being processed.
cf. [McKinney12]
![Page 32: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/32.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Data Integration (CSV)Comparable to joining tables in SQL, Pandas can merge different Dataframes.
cf. [McKinney12]
![Page 33: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/33.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Some Class Doing
Nothing
SomeClassDoingNothing
Feature Extraction (Java)The ‘right’ features need to be extracted from artifacts for further processing.
[AntoniolCCD00]
some class doing
nothing
![Page 34: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/34.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Feature Extraction (Java)The ‘javalang’ package provides a parser for Java written in Python that can be installed from git.
[web_jl]
![Page 35: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/35.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Feature Extraction (Java)The Java abstract syntax tree can be created from a file using ‘javalang’.
Java
![Page 36: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/36.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Java
SomeClassDoingNothing
Feature Extraction (Java)Intuitively, the most relevant feature in this artifact is the classname.
![Page 37: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/37.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Feature Extraction (Java)Camel-case is split and strings are made lower-case.
SomeClassDoingNothing
Some Class Doing
Nothing
some class doing
nothing
![Page 38: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/38.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Back to the ‘Big Picture’
[Aggarwal15]
![Page 39: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/39.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Analytical Processing
![Page 40: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/40.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
ClassificationSupport vector machines are provided by the ‘scikit-learn’ package as a supervised machine learning technique doing classification.
cf. [scikit_cls]
[Aggarwal15]
![Page 41: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/41.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
ClassificationSupport vector machines in Python Spark.
[spark]
![Page 42: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/42.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
ClusteringThe ‘scipy’ package provides hierarchical clustering as a unsupervised machine learning technique used to group this two-dimensional data.
cf. [web_cluster]
![Page 43: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/43.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
ClusteringHierarchical clustering outputs a linkage array that can be depicted as a dendrogram.
cf. [web_cluster]
![Page 44: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/44.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
ClusteringK-means clustering in Python Spark.
[spark]
![Page 45: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/45.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Back to the ‘Big Picture’
[Aggarwal15]
![Page 46: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/46.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Output
![Page 47: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/47.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Plot Types (Boxplot)Gives a summary of distribution of numeric variables.
Package:● Matplotlib● Seaborn
cf. [seaborn]
![Page 48: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/48.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Plot Types (Line chart)Depicts the evolution of one or many columns.
Package:● Matplotlib
![Page 49: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/49.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Plot Types (Bar chart)Depicts the ranking present in one column.
Package:● Matplotlib
![Page 50: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/50.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Plot Types (Scatter plot)Depicts the correlation of two columns.
Package:● Matplotlib● Seaborn
![Page 51: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/51.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Plot Types (Pie plot)Depicts the part-whole relation.
cf. [py_pie]
Package:● Matplotlib
![Page 52: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/52.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Scaling and AxisThe table shows metrics on, e.g., the contributed code of Developers (column ‘DCon_PE_d’). While a few developers share very high contribution values most developer’s contributions is very low for one project.
![Page 53: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/53.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Scaling and AxisAxis can have different scales to correctly depict the data.
![Page 54: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/54.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Scaling and AxisSetting the axis on log does not work due to the 0 entries.
![Page 55: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/55.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Scaling and AxisHowever, symlog works as it starts to scale linear under a given threshold.
![Page 56: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/56.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
SubplotsSupplots can be used to group multiple plots that optionally share axis.
![Page 57: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/57.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
SubplotsSome sample of subplots showing the relation between API usage and lines of code for individual APIs.
![Page 58: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/58.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
SubplotsSome other sample of different kinds of subplots sharing axis.
![Page 59: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/59.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
Back to the ‘Big Picture’
[Aggarwal15]
![Page 60: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/60.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau
References● [Aggarwal15] Aggarwal, Charu C. “Data mining: the textbook”, Springer, 2015.● [McKinney12] Wes, McKinney. "Python for data analysis.", 2012.● [AntoniolCCD00] Antoniol, Giuliano, et al. "Information retrieval models for recovering traceability links between code and
documentation." icsm. IEEE, 2000.● [Haslwanter16] Haslwanter, Thomas. "An Introduction to Statistics with Python.", Springer, 2016.● [web_json] https://developer.rhino3d.com/guides/rhinopython/python-xml-json/● [web_sql] https://www.pythoncentral.io/introduction-to-sqlite-in-python/● [webGG] https://python-graph-gallery.com/● [web_jl] https://github.com/c2nes/javalang● [pandas_interpolate] https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.interpolate.html● [scikit_cls] http://scikit-learn.org/stable/modules/svm.html● [web_cluster] https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/● [NL_reuters] https://github.com/fergiemcdowall/reuters-21578-json.git● [seborn] https://seaborn.pydata.org/● [py_pie] https://pythonspot.com/matplotlib-pie-chart/● [spark] https://spark.apache.org/docs/latest/● [spark_bp]
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/avoiding_shuffle_less_stage,_more_fast.html
![Page 61: Python & Spark Msc. Johannes Härtel PTT18/19 …softlang/pttcourse/...(C) 2018, SoftLang Team, University of Koblenz-Landau Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecee1d77a5f4970a80eda50/html5/thumbnails/61.jpg)
(C) 2018, SoftLang Team, University of Koblenz-Landau