Spark Summit - Mobius C# Binding for Apache Spark
-
Upload
shareddatamsft -
Category
Technology
-
view
267 -
download
1
Transcript of Spark Summit - Mobius C# Binding for Apache Spark
![Page 1: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/1.jpg)
MOBIUS: C# BINDING FOR SPARK
Kaarthik SivashanmugamMicrosoft@kaarthikss
![Page 2: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/2.jpg)
Quick Background• Business Scenario: Next-gen near real-time
processing of Bing.com logs– Size of raw logs: TBs per hour– C# library for processing ~ in use for several years
• Yesterday’s talk “Five Lessons Learned in Building Streaming Applications at Microsoft Bing Scale” covers this scenario & challenges
![Page 3: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/3.jpg)
C# API - Motivations• Enable organizations invested deeply in .NET to
build Apache Spark applications in C#
• Reuse of existing .NET libraries in Spark applications
![Page 4: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/4.jpg)
Why Yet Another Language Binding
FASTEST GROWING AREAS FROM 2014 TO 2015
MOST IMPORTANT ASPECTS OF SPARK
Spark Survey 2015 Results
Popularity of C#• StackOverflow.com Developer Survey• RedMonk Programming Language Rankings
.NET ecosystem ~ enabling languages like F#
![Page 5: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/5.jpg)
C# API - Goal
Make C# a first-class language for building Apache Spark applications
![Page 6: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/6.jpg)
Word Count Example in C#
Scala
C#
![Page 7: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/7.jpg)
Kafka Example in C#Initialize StreamingContext & Checkpoint
Create Kafka DStream
Use DStream transformations to count logs by loglevel within a time window
Save log count
Start stream processing
![Page 8: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/8.jpg)
Mobius: C# API for Spark
Scala/Java API
SparkR PySpark
C# API
Apache Spark
Spark Apps in C#
![Page 9: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/9.jpg)
Develop & Launch Mobius Applications
Spark Client
A
Get Mobius release
B
Get Mobius driverand dependencies
1Add Reference toMobius package in NuGet
2Develop, debug, testMobius driver application
3Build Mobius driver
Runsparkclr-submit.cmd
orsparkclr-submit.sh
CRuns Spark job
Example: sparkclr-submit.cmd --master spark://IP:PORT --total-executor-cores 200--executor-memory 12g -- conf spark.eventLog.enabled=true-- conf spark.eventLog.dir=hdfs://nn/path/to/eventlog--exe Pi.exe D:\Mobius\examples\Pi
![Page 10: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/10.jpg)
Mobius & Spark
C# Worker
CLR
IPC Sockets
C# Worker
CLR
IPC Sockets
C# Worker
CLR
IPC Sockets
C# Driver
CLR
IPC Sockets
SparkExecutor
SparkExecutor
SparkExecutor
SparkContext
JVM
JVM
JVM
JVM
Workers
Driver
Mobius can be used with any existing Spark cluster(Standalone, YARN) inWindows & Linux
![Page 11: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/11.jpg)
Mobius in Linux• Mono (open source implementation of .NET framework) used for C# with
Spark in Linux
• Mobius project CI (build, unit & functional tests) in Ubuntu
• Users reported using Mobius in Ubuntu, CentOS, OSX
• Mobius validated with Spark clusters in Azure HDInsight and Amazon Web Services EMR
• More info at linux-instructions.md @ GitHub
![Page 12: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/12.jpg)
Project Info• https://github.com/Microsoft/Mobius Contributions
welcome!
• MIT license
• Discussions– StackOverflow: tag “SparkCLR”– Gitter: https://gitter.im/Microsoft/Mobius– Twitter: @MobiusForSpark
![Page 13: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/13.jpg)
Project Status• Past Releases
– v1.5.200 (Spark 1.5.2)– v1.6.100 (Spark 1.6.1)
• Upcoming Release– v2.0.000 (Spark 2.0.0)
• Work in progress– Support for interactive scenarios (Zeppelin/Jupyter integration)– Exploration of support for ML scenarios– Idiomatic F# API
![Page 14: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/14.jpg)
UNDER THE HOOD
![Page 15: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/15.jpg)
CSharpRDD• C# operations use CSharpRDD which needs CLR to execute
– If no C# transformation or UDF, CLR is not needed ~ execution is entirely JVM-based
• RDD<byte[]>– Data is stored as serialized objects and sent to C# worker process
• Transformations are pipelined when possible– Avoids unnecessary serialization & deserialization within a stage
![Page 16: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/16.jpg)
Driver-side Interop
CSharpRunner
JVM
1 Launch
sparkclr-submit.cmdor
sparkclr-submit.sh
CSharpBackendLaunch Netty server creatingproxy for JVM calls
2
C# Driver
Launch C# processusing port number from CSharpBackend
3
CLR
SparkConf SparkContext
Create and manage
Proxies for JVM objects
SparkConf SparkContext
Interop Components
Mirror C#-side operations
Invoke JVM methods
RDD DataFrame DStream …CSharpRDD
RDD DataFrame DStream PipelinedRDD …
![Page 17: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/17.jpg)
1
Compute
2
CLR
CSharpWorker.exe
Launch
Worker-side Interop
JVM
CSharpRDD
Executor
Spark Worker
3Read bytes
5Write bytes 4
Execute C# operation
1
Compute
![Page 18: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/18.jpg)
Worker Optimization Options
CLR
Thread1
Thread2
Threadn…
CSharpWorker.exe
Multi-threaded ~ to avoid expensivefork-process when executing a Task
Spark Worker Spark Worker
CLR
CSharpWorker.exe
T1 Tn…
CSharpWorker.exe
T1 Tn…
CSharpWorker.exe
T1 Tn…
CSharpWorker.exe
T1 Tn…
CLR
CLRCLR
Multi-proc ~ for higher throughput in executing Tasks
![Page 19: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/19.jpg)
Performance Considerations• Map & Filter RDD operations in C# require serialization & deserialization of
data ~ impacts performance– C# operations are pipelined when possible ~ minimizes Ser/De– Persistence is handled by JVM ~ checkpoint/cache on a RDD impacts pipelining for
CLR operations
• DataFrame operations without C# UDFs do not require Ser/De– Perf will be same as native Scala-based Spark application– Execution plan optimization & code generation perf improvements in Spark leveraged
![Page 20: Spark Summit - Mobius C# Binding for Apache Spark](https://reader033.fdocuments.us/reader033/viewer/2022051300/58a7c3e01a28ab6b5a8b533f/html5/thumbnails/20.jpg)
THANK YOU.• Mobius is production-ready• Use Mobius to build Apache Spark jobs in .NET• Contribute to github.com/Microsoft/Mobius• @MobiusForSpark