Improving Python and Spark (PySpark) Performance and Interoperability
-
Upload
wes-mckinney -
Category
Technology
-
view
1.586 -
download
0
Transcript of Improving Python and Spark (PySpark) Performance and Interoperability
![Page 1: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/1.jpg)
www.twosigma.com
Improving Python and Spark Performance and Interoperability
February 9, 2017 All Rights Reserved
Wes McKinney @wesmckinn
Spark Summit East 2017
February 9, 2017
![Page 2: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/2.jpg)
Me
February 9, 2017
• Currently: Software Architect at Two Sigma Investments
• Creator of Python pandas project
• PMC member for Apache Arrow and Apache Parquet
• Other Python projects: Ibis, Feather, statsmodels
• Formerly: Cloudera, DataPad, AQR
• Author of Python for Data Analysis
All Rights Reserved 2
![Page 3: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/3.jpg)
Important Legal Information The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time.
Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.
Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
![Page 4: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/4.jpg)
This talk
4 February 9, 2017
• Why some parts of PySpark are “slow”
• Technology that can help make things faster
• Work we have done to make improvements
• Future roadmap
All Rights Reserved
![Page 5: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/5.jpg)
Python and Spark
February 9, 2017
• Spark is implemented in Scala, runs on the Java virtual machine (JVM)
• Spark has Python and R APIs with partial or full coverage for many parts of the Scala Spark API
• In some Spark tasks, Python is only a scripting front-end.
• This means no interpreted Python code is executed once the Spark job starts
• Other PySpark jobs suffer performance and interoperability issues that we’re going to analyze in this talk
All Rights Reserved 5
![Page 6: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/6.jpg)
Spark DataFrame performance
February 9, 2017 All Rights Reserved
Source: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
6
![Page 7: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/7.jpg)
Spark DataFrame performance can be misleading
February 9, 2017
• Spark DataFrames are an example of Python as a DSL / scripting front end
• Excepting UDFs (.map(…) or sqlContext.registerFunction), no Python code is evaluated in the Spark job
• Python API calls create SQL query plans inside the JVM — so Scala and Python versions are computationally identical
All Rights Reserved 7
![Page 8: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/8.jpg)
Spark DataFrames as deferred DSL
February 9, 2017
young=users[users.age<21]
young.groupBy(“gender”).count()
All Rights Reserved 8
![Page 9: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/9.jpg)
Spark DataFrames as deferred DSL
February 9, 2017
SELECTgender,COUNT(*)
FROMusers
WHEREage<21
GROUPBY1
All Rights Reserved 9
![Page 10: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/10.jpg)
Spark DataFrames as deferred DSL
February 9, 2017
Aggregation[table]table:Table:usersmetrics:count=Count[int64]Table:ref_0by:gender=Column[array(string)]'gender'fromuserspredicates:Less[array(boolean)]age=Column[array(int32)]'age'fromusersLiteral[int8]21
All Rights Reserved 10
![Page 11: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/11.jpg)
Where Python code and Spark meet
February 9, 2017
• Unfortunately, many PySpark jobs cannot be expressed entirely as DataFrame operations or other built-in Scala constructs
• Spark-Scala interacts with in-memory Python in key ways:
• Reading and writing in-memory datasets to/from the Spark driver • Evaluating custom Python code (user-defined functions)
All Rights Reserved 11
![Page 12: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/12.jpg)
How PySpark lambda functions work
February 9, 2017
• The anatomy of
All Rights Reserved
rdd.map(lambdax:…)df.withColumn(py_func(...))
Scala RDD
Python worker
Python worker
Python worker
Python worker
Python worker
see PythonRDD.scala
12
![Page 13: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/13.jpg)
PySpark lambda performance problems
February 9, 2017
• See 2016 talk “High Performance Python on Apache Spark”
• http://www.slideshare.net/wesm/high-performance-python-on-apache-spark
• Problems
• Inefficient data movement (serialization / deserialization)
• Scalar computation model: object boxing and interpreter overhead
• General summary: PySpark is not currently designed to achieve high performance in the way that pandas and NumPy are.
All Rights Reserved 13
![Page 14: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/14.jpg)
Other issues with PySpark lambdas
February 9, 2017
• Computation model unlike what pandas users are used to
• In dataframe.map(f), the Python function fonly sees one Row at a time • A more natural and efficient vectorized API would be:
• dataframe.map_pandas(lambdadf:…)
All Rights Reserved 14
![Page 15: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/15.jpg)
February 9, 2017 All Rights Reserved
Apache Arrow
15
![Page 16: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/16.jpg)
Apache Arrow: Process and Move Data Fast
February 9, 2017
• New Top-level Apache project as of February 2016
• Collaboration amongst broad set of OSS projects around shared needs
• Language-independent columnar data structures
• Metadata for describing schemas / chunks of data
• Protocol for moving data between processes with minimal serialization overhead
All Rights Reserved 16
![Page 17: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/17.jpg)
High performance data interchange
February 9, 2017 All Rights Reserved
Today With Arrow
Source: Apache Arrow
17
![Page 18: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/18.jpg)
What does Apache Arrow give you?
February 9, 2017
• Zero-copy columnar data: Complex table and array data structures that can reference memory without copying it
• Ultrafast messaging: Language-agnostic metadata, batch/file-based and streaming binary formats
• Complex schema support: Flat and nested data types
• C++, Python, and Java Implementations: with integration tests
All Rights Reserved 18
![Page 19: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/19.jpg)
Arrow binary wire formats
February 9, 2017 All Rights Reserved 19
![Page 20: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/20.jpg)
Extreme performance to pandas from Arrow streams
February 9, 2017 All Rights Reserved 20
![Page 21: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/21.jpg)
PyArrow file and streaming API
February 9, 2017
frompyarrowimportStreamReaderreader=StreamReader(stream)#pyarrow.Tabletable=reader.read_all()#Converttopandasdf=table.to_pandas()
All Rights Reserved 21
![Page 22: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/22.jpg)
For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this presentation for important disclosure information.
Making DataFrame.toPandas faster
February 9, 2017
• Background
• Spark’s toPandas transfers in-memory from the Spark driver to Python and converts it to a pandas.DataFrame. It is very slow
• Joint work with Bryan Cutler (IBM), Li Jin (Two Sigma), and Yin Xusen (IBM). See SPARK-13534 on JIRA
• Test case: transfer 128MB Parquet file with 8 DOUBLE columns
All Rights Reserved 22
![Page 23: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/23.jpg)
February 9, 2017 All Rights Reserved
condainstallpyarrow-cconda-forge
23
![Page 24: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/24.jpg)
For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this presentation for important disclosure information.
Making DataFrame.toPandas faster
February 9, 2017 All Rights Reserved
df=sqlContext.read.parquet('example2.parquet')df=df.cache()df.count() Then %%prun-scumulativedfs=[df.toPandas()foriinrange(5)]
24
![Page 25: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/25.jpg)
For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this presentation for important disclosure information.
Making DataFrame.toPandas faster
February 9, 2017 All Rights Reserved
94483943functioncalls(94478223primitivecalls)in62.492secondsncallstottimepercallcumtimepercallfilename:lineno(function)51.4580.29262.49212.498dataframe.py:1570(toPandas)50.6610.13254.75910.952dataframe.py:382(collect)104857650.6690.00046.8230.000rdd.py:121(_load_from_socket)7150.0020.00046.1390.065serializers.py:141(load_stream)7100.0020.00045.9500.065serializers.py:448(loads)104857604.9690.00032.8530.000types.py:595(fromInternal)13910.0040.0007.4450.005socket.py:562(readinto)180.0000.0007.2830.405java_gateway.py:1006(send_command)50.0000.0006.2621.252frame.py:943(from_records)
25
![Page 26: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/26.jpg)
For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this presentation for important disclosure information.
Making DataFrame.toPandas faster
February 9, 2017 All Rights Reserved
Now,usingpyarrow%%prun-scumulativedfs=[df.toPandas(useArrow)foriinrange(5)]
26
![Page 27: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/27.jpg)
For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this presentation for important disclosure information.
Making DataFrame.toPandas faster
February 9, 2017 All Rights Reserved
38585functioncalls(38535primitivecalls)in9.448secondsncallstottimepercallcumtimepercallfilename:lineno(function)50.0010.0009.4481.890dataframe.py:1570(toPandas)50.0000.0009.3581.872dataframe.py:394(collectAsArrow)62719.3300.0019.3300.001{method'recv_into'of'_socket.socket‘}150.0000.0009.2290.615java_gateway.py:860(send_command)100.0000.0000.1230.012serializers.py:141(load_stream)50.0850.0170.0890.018{method'to_pandas'of'pyarrow.Table‘}
27
![Page 28: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/28.jpg)
pipinstallmemory_profiler
February 9, 2017 All Rights Reserved
%%memit-i0.0001pdf=Nonepdf=df.toPandas()gc.collect()peakmemory:1223.16MiB,increment:1018.20MiB
28
![Page 29: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/29.jpg)
Plot thickens: memory use
February 9, 2017 All Rights Reserved
%%memit-i0.0001pdf=Nonepdf=df.toPandas(useArrow=True)gc.collect()peakmemory:334.08MiB,increment:258.31MiB
29
![Page 30: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/30.jpg)
Summary of results
February 9, 2017
• Current version: average 12.5s (10.2 MB/s)
• Deseralization accounts for 88% of time; the rest is waiting for Spark to send the data
• Peak memory use 8x (~1GB) the size of the dataset • Arrow version
• Average wall clock time of 1.89s (6.61x faster, 67.7 MB/s)
• Deserialization accounts for 1% of total time • Peak memory use 2x the size of the dataset (1 memory doubling) • Time for Spark to send data 25% higher (1866ms vs 1488 ms)
All Rights Reserved 30
![Page 31: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/31.jpg)
Aside: reading Parquet directly in Python
February 9, 2017
importpyarrow.parquetaspq
%%timeit
df=pq.read_table(‘example2.parquet’).to_pandas()
10loops,bestof3:175msperloop
All Rights Reserved 31
![Page 32: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/32.jpg)
Digging deeper
February 9, 2017
• Why does it take Spark ~1.8 seconds to send 128MB of data over the wire?
valcollectedRows=queryExecution.executedPlan.executeCollect()
cnvtr.internalRowsToPayload(collectedRows,this.schema)
All Rights Reserved
Array[InternalRow]
32
![Page 33: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/33.jpg)
Digging deeper
February 9, 2017
• In our 128MB test case, on average:
• 75% of time is being spent collecting Array[InternalRow]from the task executors
• 25% of the time is spent on a single-threaded conversion of all the data from Array[InternalRow] to ArrowRecordBatch
• We can go much faster by performing the Spark SQL -> Arrow conversion locally on the task executors, then streaming the batches to Python
All Rights Reserved 33
![Page 34: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/34.jpg)
Future architecture
February 9, 2017 All Rights Reserved
Task executor
Task executor
Task executor
Task executor
Arrow RecordBatch
Arrow RecordBatch
Arrow RecordBatch
Arrow RecordBatch
Arrow Schema
Spark driver
Python 34
![Page 35: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/35.jpg)
For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this presentation for important disclosure information.
Hot off the presses
February 9, 2017 All Rights Reserved
ncallstottimepercallcumtimepercallfilename:lineno(function)50.0000.0005.9281.186dataframe.py:1570(toPandas)50.0000.0005.8381.168dataframe.py:394(collectAsArrow)59190.0050.0005.8240.001socket.py:561(readinto)59195.8090.0015.8090.001{method'recv_into'of'_socket.socket‘}...50.0860.0170.0910.018{method'to_pandas'of'pyarrow.Table‘}
35
Patch from February 8: 38% perf improvement
![Page 36: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/36.jpg)
The work ahead
February 9, 2017
• Luckily, speeding up toPandas and speeding up Lambda / UDF functions is architecturally the same type of problem
• Reasonably clear path to making toPandas even faster
• How can you get involved?
• Keep an eye on Spark ASF JIRA
• Contribute to Apache Arrow (Java, C++, Python, other languages) • Join the Arrow and Spark mailing lists
All Rights Reserved 36
![Page 37: Improving Python and Spark (PySpark) Performance and Interoperability](https://reader033.fdocuments.us/reader033/viewer/2022051101/58a3e0041a28ab7f0b8b5f63/html5/thumbnails/37.jpg)
Thank you
February 9, 2017
• Bryan Cutler, Li Jin, and Yin Xusen, for building the integration Spark-Arrow integration
• Apache Arrow community
• Spark Summit organizers
• Two Sigma and IBM, for supporting this work
All Rights Reserved 37