GPU Computing With Apache Spark And Python
-
Upload
jen-aman -
Category
Data & Analytics
-
view
1.238 -
download
5
Transcript of GPU Computing With Apache Spark And Python
![Page 1: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/1.jpg)
GPU Computing With Apache Spark And Python
Siu Kwan Lam Continuum Analytics
![Page 2: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/2.jpg)
• I’m going to use Anaconda throughout this presentation.
• Anaconda is a free Mac/Win/Linux Python distribution: • Based on conda, an open source package manager • Installs both Python and non-Python dependencies • Easiest way to get the software I will talk about today
• https://www.continuum.io/downloads
![Page 3: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/3.jpg)
Overview• Why Python? • Using GPU in PySpark
• An example: Image registration • Accelerate: Drop-in GPU-accelerated functions • Numba: JIT Custom GPU-accelerated functions
• Tips & Tricks
![Page 4: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/4.jpg)
WHY PYTHON?
![Page 5: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/5.jpg)
Why is Python so popular?• Straightforward, productive language for system
administrators, programmers, scientists, analysts and hobbyists
• Great community:
• Lots of tutorial and reference materials
• Vast ecosystem of useful libraries
• Easy to interface with other languages
![Page 6: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/6.jpg)
But… Python is slow!
![Page 7: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/7.jpg)
But… Python is slow!• Pure, interpreted Python is slow.
• Python excels at interfacing with other languages used in HPC:
• C: ctypes, CFFI, Cython
• C++: Cython, Boost.Python
• FORTRAN: f2py
• Secret: Most scientific Python packages put the speed critical sections of their algorithms in a compiled language.
![Page 8: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/8.jpg)
Is there another way?• Switching languages for speed in your projects
can be a little clunky
• Generating compiled functions for the wide range of data types can be tedious
• How can we use cutting edge hardware, like GPUs?
![Page 9: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/9.jpg)
IMAGE REGISTRATIONAn example for using GPU in PySpark
![Page 10: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/10.jpg)
Image Registration• An experiment to demonstrate GPU usage • The problem:
• stitch image fragments • fragments are randomly orientated, translated
and scaled. • phase-correlation for image registration
• FFT heavy
![Page 11: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/11.jpg)
Basic Algorithm
Groupsimilarimages
Imageregistrationwithineachgroup
imageset
unusedimages newimages
Progress?
![Page 12: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/12.jpg)
Basic Algorithm
• FFT2Dheavy• Expensive• PotentialspeeduponGPU
• KMeansClustering• Histogramasmetric
Groupsimilarimages
Imageregistrationwithineachgroup
imageset
unusedimages newimages
Progress?
![Page 13: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/13.jpg)
Setup• conda can create a local environment for Spark for you:
conda create -n spark -c anaconda-cluster python=3.5 spark \ accelerate ipython-notebook
source activate spark
IPYTHON_OPTS="notebook" ./bin/pyspark # starts jupyter notebook
![Page 14: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/14.jpg)
Performance Bottleneck• Most of the time spent in 2D FFT
![Page 15: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/15.jpg)
ACCELERATE DROP-IN GPU-ACCELERATED FUNCTIONS
![Page 16: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/16.jpg)
Accelerate• Commercial licensed • Hardware optimized numerical functions • SIMD optimized via MKL • GPU accelerated via CUDA
![Page 17: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/17.jpg)
CUDA Library Bindings: cuFFT
>2xspeedupincl.host<->deviceroundtriponGeForceGT650M
MKLacceleratedFFT
![Page 18: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/18.jpg)
CPS with GPU drop-in• Replace numpy FFT with accelerate version
![Page 19: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/19.jpg)
CPS with GPU drop-in• Replace numpy FFT with accelerate version
CPU-GPU transfer
CPU-GPU transfer
![Page 20: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/20.jpg)
NUMBA JIT CUSTOM GPU-ACCELERATED FUNCTIONS
![Page 21: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/21.jpg)
Numba• Opensource licensed • A Python JIT as a CPython library • Array/numerical subset • Targets CPU and GPU
![Page 22: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/22.jpg)
Supported Platforms
• Experimental support for ARMv7 (Raspberry Pi 2)
OS HW SW
• Windows (7 and later) • 32 and 64-bit x86 CPUs • Python 2 and 3
• OS X (10.7 and later) • CUDA-capable NVIDIA GPUs • NumPy 1.7 through 1.11
• Linux (~RHEL 5 and later) • HSA-capable AMD GPUs
• Experimental support for ARMv7 (Raspberry Pi 2)
![Page 23: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/23.jpg)
How Does Numba Work?
Python Function (bytecode)
Bytecode Analysis
Functions Arguments
Numba IR
Machine CodeExecute!
Type Inference
LLVM/NVVM JIT LLVM IR
Lowering
Rewrite IR
Cache
@jit def do_math(a, b): … >>> do_math(x, y)
![Page 24: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/24.jpg)
Ufuncs—Map operation for ND arrays
![Page 25: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/25.jpg)
Decoratorforcreatingufunc
Listofsupportedtypesignatures
Codegenerationtarget
Ufuncs—Map operation for ND arrays
![Page 26: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/26.jpg)
GPU Ufuncs Performance
4xspeedupincl.host<->deviceroundtriponGeForceGT650M
![Page 27: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/27.jpg)
Numba in Spark• Compiles to IR on client
• Or not if type information is not available yet
• Send IR to workers • Finalize to machine code on workers
![Page 28: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/28.jpg)
cuFFT
explicitmemorytransfer
CPS with cuFFT + GPU ufuncs
![Page 29: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/29.jpg)
TIPS & TRICKS
![Page 30: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/30.jpg)
Operate in Batches• GPUs have many-cores • Best to do many similar task at once • GPU kernel launch has overhead • prefer mapPartitions, mapValues over map
![Page 31: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/31.jpg)
Under-utilization of GPU• PySpark spawns 1 Python process per core • Only 1 CUDA process per GPU at a time • Under-utilize the GPU easily • GPU context-switching between processes
![Page 32: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/32.jpg)
Under-utilization of GPU (Fix)• nvidia-cuda-mps-control • Originally for MPI • Allow multiple process per GPU • Reduce per-process overhead • Increase GPU utilization
• 10-15% speedup in our experiment
![Page 33: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/33.jpg)
Summary• Anaconda:
• creates Spark environment for experimentation • manages Python packages for use in Spark
• Accelerate: • Pre-built GPU functions within PySpark
• Numba: • JIT custom GPU functions within PySpark
![Page 35: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/35.jpg)
Extras
![Page 36: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/36.jpg)
NUMBA: A PYTHON JIT COMPILER
![Page 37: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/37.jpg)
Compiling Python• Numba is a type-specializing compiler for Python functions
• Can translate Python syntax into machine code if all type information can be deduced when the function is called.
• Code generation done with:
– LLVM (for CPU)
– NVVM (for CUDA GPUs).
![Page 38: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/38.jpg)
How Does Numba Work?
Python Function (bytecode)
Bytecode Analysis
Functions Arguments
Numba IR
Machine CodeExecute!
Type Inference
LLVM/NVVM JIT LLVM IR
Lowering
Rewrite IR
Cache
@jit def do_math(a, b): … >>> do_math(x, y)
![Page 39: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/39.jpg)
Numba on the CPU
![Page 40: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/40.jpg)
Array Allocation
Looping over ndarray x as an iterator
Using numpy math functions
Returning a slice of the array
2.7x speedup!
Numba decorator (nopython=True not required)
Numba on the CPU
![Page 41: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/41.jpg)
CUDA Kernels in Python
![Page 42: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/42.jpg)
CUDA Kernels in PythonDecoratorwillinfertypesignaturewhenyoucallit
NumPyarrayshaveexpectedattributesandindexing
HelperfunctiontocomputeblockIdx.x * blockDim.x + threadIdx.x
HelperfunctiontocomputeblockDim.x * gridDim.x
![Page 43: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/43.jpg)
Calling the Kernel from Python
WorksjustlikeCUDAC,exceptwehandleallocatingandcopyingdatato/fromthehostifneeded
![Page 44: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/44.jpg)
Handling Device Memory Directly
Memoryallocationmattersinsmalltasks.
![Page 45: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/45.jpg)
NUMBA IN SPARK
![Page 46: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/46.jpg)
SparkContext
SparkContext
SparkWorker
PythonInterpreter
SparkWorker
PythonInterpreter
Python Java
Java
Java
Python
Python
Client Cluster
Py4J
Py4J
Py4J
![Page 47: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/47.jpg)
Using Numba with Spark
![Page 48: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/48.jpg)
Using CUDA Python with Spark• conda can create a local environment for Spark for you:
DefineCUDAkernelCompilationhappenshere
WrapCUDAkernellaunchinglogic
CreatesSparkRDD(8partitions)
Applygpu_workoneachpartition
![Page 49: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/49.jpg)
Compilation and Code Delivery
Client Cluster
CUDAPythonKernel
LLVMIR
PTX
LLVMIR PTX
PTX
CUDABinary
Compiles Serialize FinalizeDeserialize
![Page 50: GPU Computing With Apache Spark And Python](https://reader033.fdocuments.us/reader033/viewer/2022051101/58f9a950760da3da068b6d5e/html5/thumbnails/50.jpg)
Compilation and Code Delivery
Client Cluster
CUDAPythonKernel
LLVMIR
PTX
LLVMIR PTX
PTX
CUDABinary
Compiles Serialize FinalizeDeserialize
Thishappensoneveryworkerprocess