The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
-
Upload
miklos-christine -
Category
Technology
-
view
400 -
download
3
Transcript of The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
![Page 1: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/1.jpg)
The Nitty Gritty of Advanced Analytics
Using Apache Spark in Python
Miklos Christine Solutions [email protected], @Miklos_C
![Page 2: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/2.jpg)
About MeMiklos ChristineSolutions Architect @ Databricks
- [email protected] Miklos_C@twitter
Systems Engineer @ Cloudera Supported a few of the largest clusters in the world
Software Engineer @ CiscoUC Berkeley Graduate
![Page 3: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/3.jpg)
We are Databricks, the company behind Spark
Founded by the creators of Apache Spark in 2013
Share of Spark code contributed by Databricksin 2014
75%
3
Data Value
Created Databricks on top of Spark to make big data simple.
![Page 4: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/4.jpg)
…
Apache Spark Engine
Spark Core
Spark StreamingSpark SQL MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, and R APIs
Standard libraries
![Page 5: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/5.jpg)
![Page 6: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/6.jpg)
2012
started@
Berkeley
2010
researchpaper
2013
Databricksstarted
& donatedto ASF
2014
Spark 1.0 & libraries
(SQL, ML, GraphX)
2015
DataFramesTungsten
ML Pipelines
2016
Spark 2.0
![Page 7: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/7.jpg)
Spark Community Growth• Spark Survey 2015
Highlights• End of Year Spark Highlights
![Page 8: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/8.jpg)
2015: A Great Year for Spark
Most active open source project in (big) data• 1000+ code contributors
New language: R
Widespread industry support & adoption
![Page 9: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/9.jpg)
![Page 10: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/10.jpg)
![Page 11: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/11.jpg)
![Page 12: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/12.jpg)
![Page 13: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/13.jpg)
HOW RESPONDENTS ARE RUNNING SPARK
51%
on a public cloud
TOP ROLES USING SPARK
of respondents identifythemselves as Data Engineers
41%
of respondents identifythemselves as Data Scientists
22%
![Page 14: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/14.jpg)
Spark User Highlights
![Page 15: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/15.jpg)
NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO
Source: Slide 5 of Spark Community Update
![Page 16: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/16.jpg)
Large-Scale Usage
Largest cluster:8000 Nodes (Tencent)
Largest single job:1 PB (Alibaba, Databricks)
Top Streaming Intake:1 TB/hour (HHMI Janelia Farm)
2014 On-Disk Sort RecordFastest Open Source Engine for sorting a PB
![Page 17: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/17.jpg)
Spark API Performance
![Page 18: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/18.jpg)
History of Spark APIs
RDD(2011)
DataFrame(2013)
Distribute collection of JVM objects
Functional Operators (map, filter, etc.)
Distribute collection of Row objects
Expression-based operations and UDFs
Logical plans and optimizer
Fast/efficient internal representations
DataSet(2015)
Internally rows, externally JVM objects
Almost the “Best of both worlds”: type safe + fast
But slower than DF Not as good for interactive analysis, especially Python
![Page 19: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/19.jpg)
Benefit of Logical Plan:Performance Parity Across Languages
DataFrame
RDD
![Page 20: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/20.jpg)
ETL with Spark
![Page 21: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/21.jpg)
ETL: Extract, Transform, Load
● Key factor for big data platforms
● Provides Speed Improvements in All Workloads
● Typically Executed by Data Engineers
![Page 22: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/22.jpg)
File Formats
● Text File Formats○ CSV○ JSON
● Avro Row Format
● Parquet Columnar Format
![Page 23: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/23.jpg)
File Formats + Compression
● File Formats○ JSON
○ CSV
○ Avro
○ Parquet
● Compression Codecs○ No compression
○ Snappy
○ Gzip
○ LZO
![Page 24: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/24.jpg)
● Industry Standard File Format: Parquet
○ Write to Parquet:
df.write.format(“parquet”).save(“namesAndAges.parquet”)
df.write.format(“parquet”).saveAsTable(“myTestTable”)
○ For compression:
spark.sql.parquet.compression.codec = (gzip, snappy)
Spark Parquet Properties
![Page 25: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/25.jpg)
Small Files Problem
● Small files problem still exists
● Metadata loading
● APIs:df.coalesce(N)df.repartition(N)
Ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
![Page 26: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/26.jpg)
● RDD / DataFrame Partitionsdf.rdd.getNumPartitions()
● SparkSQL Shuffle Partitionsspark.sql.shuffle.partitions
● Table Level Partitionsdf.write.partitionBy(“year”).\save(“data.parquet”)
All About Partitions
![Page 27: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/27.jpg)
# CSV
df = sqlContext.read.\
format('com.databricks.spark.csv').\
options(header='true', inferSchema='true').\
load('/path/to/data')
# JSON
df = sqlContext.read.json("/tmp/test.json")
df.write.json("/tmp/test_output.json")
PySpark ETL APIs - Text Formats
![Page 28: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/28.jpg)
PySpark ETL APIs - Container Formats
# Binary Container Formats
# Avro
df = sqlContext.read.\
format("com.databricks.spark.avro").\
load("/path/to/files/")
# Parquet
df = sqlContext.read.parquet("/path/to/files/")
df.write.parquet("/path/to/files/")
![Page 29: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/29.jpg)
● Manage Number of Files○ APIs manage the number of files per directory
df.repartition(80).\
write.\
parquet("/path/to/parquet/")
df.repartition(80)
partitionBy("year")\
write.\
parquet("/path/to/parquet/")
PySpark ETL APIs
![Page 30: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/30.jpg)
Common ETL Problems
● Malformed JSON RecordssqlContext.sql("SELECT _corrupt_record FROM jsonTable WHERE _corrupt_record IS NOT NULL")
● Mismatched DataFrame Schema○ Null Representation vs Schema DataType
● Many Small Files / No Partition Strategy○ Parquet Files: ~128MB - 256MB Compressed
Ref: https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dealing_with_bad_data.html
![Page 31: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/31.jpg)
Debugging Spark
Spark Driver Error: Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 362.0 failed 4 times, most recent failure: Lost task 1.3 in stage 362.0 (TID 275202, ip-10-111-225-98.ec2.internal): java.nio.channels.ClosedChannelException
Spark Executor Error:16/04/13 20:02:16 ERROR DefaultWriterContainer: Aborting task.
java.text.ParseException: Unparseable number: "\N"
at java.text.NumberFormat.parse(NumberFormat.java:385)
at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply$mcD$sp(TypeCast.scala:58)
at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58)
at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58)
at scala.util.Try.getOrElse(Try.scala:77)
at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:58)
![Page 32: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/32.jpg)
Debugging Spark
![Page 33: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/33.jpg)
SQL with Spark
![Page 34: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/34.jpg)
SparkSQL Best Practices
● DataFrames and SparkSQL are synonyms● Use builtin functions instead of custom UDFs
○ import pyspark.sql.functions
● Examples:○ to_date()○ get_json_object() ○ regexp_extract()○ hour() / minute()
Ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
![Page 35: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/35.jpg)
SparkSQL Best Practices
● Large Table Joins
○ Largest Table on LHS
○ Increase Spark Shuffle Partitions
○ Leverage “cluster by” API included in Spark 1.6sqlCtx.sql("select * from large_table_1 cluster by num1")
.registerTempTable("sorted_large_table_1");
sqlCtx.sql(“cache table sorted_large_table_1”);
![Page 36: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/36.jpg)
PySpark API Best Practices● User Defined Functions (UDFs)
from pyspark.sql import functions as F
add_n = udf(lambda x, y: x + y, IntegerType())
# We register a UDF that adds a column to the DataFrame, and we cast the id column to an Integer type.
df = df.withColumn('id_offset',
add_n( F.lit(1000), df.id.cast(IntegerType())))
![Page 37: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/37.jpg)
PySpark API Best Practices
● Built-in Functions
corpus_df = df.select( \
F.lower( F.col('body')).alias('corpus'), \
F.monotonicallyIncreasingId().alias('id'))
corpus_df = df.select( \
F.date_format( F.from_utc_timestamp( \
F.from_unixtime(F.col('created_utc'), "PST"), 'EEEE')).alias('dayofweek'))
Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
![Page 38: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/38.jpg)
PySpark API Best Practices
● User Defined Functions (UDFs)
def squared(s):
return s * s
sqlContext.udf.register("squaredWithPython", squared)
display(df.select("id", squared_udf("id").alias("id_squared")))
![Page 39: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/39.jpg)
ML with Spark
![Page 40: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/40.jpg)
Data Science Time
40
![Page 41: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/41.jpg)
Why Spark ML
Provide general purpose ML algorithms on top of Spark• Let Spark handle the distribution of data and queries; scalability• Leverage its improvements (e.g. DataFrames, Datasets, Tungsten)
Advantages of MLlib’s Design:• Simplicity• Scalability• Streamlined end-to-end• Compatibility
![Page 42: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/42.jpg)
High-level functionality in MLlib
Learning tasksClassificationRegressionRecommendationClusteringFrequent
itemsets
42
Workflow utilities• Model import/export• Pipelines• DataFrames• Cross validation
Data utilities• Feature
extraction & selection
• Statistics• Linear algebra
![Page 43: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/43.jpg)
Machine Learning: What and Why?
ML uses data to identify patterns and make decisions.Core value of ML is automated decision making
• Especially important when dealing with TB or PB of data
Many Use Cases including:• Marketing and advertising optimization• Security monitoring / fraud detection• Operational optimizations
![Page 44: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/44.jpg)
Algorithm coverage in MLlibClassification• Logistic regression w/ elastic net• Naive Bayes• Streaming logistic regression• Linear SVMs• Decision trees• Random forests• Gradient-boosted trees• Multilayer perceptron• One-vs-rest
Regression• Least squares w/ elastic net• Isotonic regression• Decision trees• Random forests• Gradient-boosted trees• Streaming linear methods
Recommendation• Alternating Least Squares
Frequent itemsets• FP-growth• Prefix span
Clustering• Gaussian mixture models• K-Means• Streaming K-Means• Latent Dirichlet Allocation• Power Iteration Clustering
Statistics• Pearson correlation• Spearman correlation• Online summarization• Chi-squared test• Kernel density estimation
Linear algebra• Local dense & sparse vectors & matrices• Distributed matrices
• Block-partitioned matrix• Row matrix• Indexed row matrix• Coordinate matrix
• Matrix decompositions
Model import/exportPipelines
Feature extraction & selection• Binarizer• Bucketizer• Chi-Squared selection• CountVectorizer• Discrete cosine transform• ElementwiseProduct• Hashing term frequency• Inverse document frequency• MinMaxScaler• Ngram• Normalizer• One-Hot Encoder• PCA• PolynomialExpansion• RFormula• SQLTransformer• Standard scaler• StopWordsRemover• StringIndexer• Tokenizer• StringIndexer• VectorAssembler• VectorIndexer• VectorSlicer• Word2Vec List based on Spark
1.5 44
![Page 45: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/45.jpg)
Spark ML Best Practices
● Spark MLLib vs SparkML ○ Understand the differences
● Don’t Pipeline Too Many Stages ○ Check Results Between Stages
![Page 46: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/46.jpg)
PySpark ML API Best Practices
![Page 47: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/47.jpg)
PySpark ML API Best Practices
![Page 48: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/48.jpg)
● DataFrame to RDD Mapping
def tokenize(text):
tokens = word_tokenize(text)
lowercased = [t.lower() for t in tokens]
no_punctuation = []
for word in lowercased:
punct_removed = ''.join([letter for letter in word if not letter in PUNCTUATION])
no_punctuation.append(punct_removed)
no_stopwords = [w for w in no_punctuation if not w in STOPWORDS]
stemmed = [STEMMER.stem(w) for w in no_stopwords]
return [w for w in stemmed if w]
rdd = wordsDataFrame.map(lambda x: (x.__getitem__('id'), tokenize(x.__getitem__('corpus'))))
PySpark ML API Best Practices
![Page 49: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/49.jpg)
Learning more about MLlibGuides & examples• Example workflow using ML Pipelines (Python)• The above 2 links are part of the Databricks Guide, which contains many more
examples and references.References• Apache Spark MLlib User Guide
• The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API documentation.
• Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv.org/abs/1505.06807 (academic paper)
49
![Page 50: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/50.jpg)
Spark Demo
![Page 51: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python](https://reader034.fdocuments.us/reader034/viewer/2022042618/58ae8a461a28abdf068b4eed/html5/thumbnails/51.jpg)
Thanks!
Sign Up For Databricks Community Edition! http://go.databricks.com/databricks-community-edition-beta-waitlist