Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

Post on 16-Apr-2017

507 views 2 download

Transcript of Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Enabling Apache Zeppelin* and Spark* for Data Science in the Enterprise

Bikas Saha@bikassaha

*Apache Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Zeppelin and the Hadoop elephant logo are trademarks of the Apache Software Foundation.

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaMaking Big Data Science easy to approach

What are the current issues for the enterprise

Making Apache Zeppelin enterprise ready

Future Roadmap

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zeppelin makes Big Data Science Easy to Approach

Zero install – Just connect via a web browser and ready to run Support for multiple execution platforms (Apache Spark, JDBC, Hive…) Support for multiple languages (Scala, SQL, Python…) Support for built-in visualizations Support for reporting Support for sharing and collaborative work

Does NOT have machine learning built-in – that’s where Apache Spark comes in (or your favorite SQL engine Apache Flink/Drill/Hive… and 30+ others)

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zeppelin for Sharing

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaMaking Big Data Science easy to approach

What are the current issues for the enterprise

Making Apache Zeppelin enterprise ready

Future Roadmap

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Current Apache Zeppelin and Spark integration

ZeppelinServer

SparkDriver

User

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Architectural Issue with Secure Data Access

ZeppelinServer

SparkDriver

User 1 Spark

Executor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

Zeppelin ServerUser

HDFS

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Architectural Issues with Multi-Tenancy – Fault Tolerance

ZeppelinServer

SparkDriver

Us

er1

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

Us

er2

User 1 failure affects User 2

Heavy-weight Spark drivers

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Architectural Issues with Multi-Tenancy – Privacy

ZeppelinServer

SparkDriver

Us

er1

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

Us

er2

User 1 can

access User 2Data

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaMaking Big Data Science easy to approach

What are the current issues for the enterprise

Enterprise Ready Big Data Science

Future Roadmap

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Livy Server as a Session Management Service

LivyServer

Remote Spark Driver

Session Remote Context

Interactive REST API

BatchREST API

Standard Spark Batch Job

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Secure Data Access - Solved

ZeppelinServer

LivyInterpreter

User

SparkExecutor

SparkExecutor

LivyServer

Remote Spark Driver

Session

Remote Context

User

HDFS

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Multi Tenancy - Solved

ZeppelinServer

LivyInterpreter

LivyServer

Session 1

Us

er1

Us

er2

LivyInterpreter

Session 2

Remote Spark Driver

Remote Context

SparkExecutor

Remote Spark Driver

Remote Context

SparkExecutor

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaMaking Big Data Science easy to approach

What are the current issues for the enterprise

Making Apache Zeppelin enterprise ready

Future Roadmap

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Near Term Improvements

Session Management Debuggability Unified session for all languages Better visualizations for Machine Learning Support for Spark 2.0

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Long Term Improvements

Controlled sharing of sessions for collaboration Data exploration and browsing with metadata Taking the model from training to production

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank You