Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

Enabling Apache Zeppelin* and Spark* for Data Science in the Enterprise

Bikas Saha@bikassaha

*Apache Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Zeppelin and the Hadoop elephant logo are trademarks of the Apache Software Foundation.

AgendaMaking Big Data Science easy to approach

What are the current issues for the enterprise

Making Apache Zeppelin enterprise ready

Future Roadmap

Apache Zeppelin

Zeppelin makes Big Data Science Easy to Approach

Zero install – Just connect via a web browser and ready to run Support for multiple execution platforms (Apache Spark, JDBC, Hive…) Support for multiple languages (Scala, SQL, Python…) Support for built-in visualizations Support for reporting Support for sharing and collaborative work

Does NOT have machine learning built-in – that’s where Apache Spark comes in (or your favorite SQL engine Apache Flink/Drill/Hive… and 30+ others)

Zeppelin for Sharing

Future Roadmap

Current Apache Zeppelin and Spark integration

ZeppelinServer

SparkDriver

SparkExecutor

Architectural Issue with Secure Data Access

ZeppelinServer

SparkDriver

User 1 Spark

Executor

SparkExecutor

Zeppelin ServerUser

Architectural Issues with Multi-Tenancy – Fault Tolerance

ZeppelinServer

SparkDriver

SparkExecutor

User 1 failure affects User 2

Heavy-weight Spark drivers

Architectural Issues with Multi-Tenancy – Privacy

ZeppelinServer

SparkDriver

SparkExecutor

User 1 can

access User 2Data

Enterprise Ready Big Data Science

Future Roadmap

Livy Server as a Session Management Service

LivyServer

Remote Spark Driver

Session Remote Context

Interactive REST API

BatchREST API

Standard Spark Batch Job

SparkExecutor

Secure Data Access - Solved

ZeppelinServer

LivyInterpreter

SparkExecutor

LivyServer

Remote Spark Driver

Session

Remote Context

Multi Tenancy - Solved

ZeppelinServer

LivyInterpreter

LivyServer

Session 1

LivyInterpreter

Session 2

Remote Spark Driver

Remote Context

SparkExecutor

Remote Spark Driver

Remote Context

SparkExecutor

Future Roadmap

Near Term Improvements

Session Management Debuggability Unified session for all languages Better visualizations for Machine Learning Support for Spark 2.0

Long Term Improvements

Controlled sharing of sessions for collaboration Data exploration and browsing with metadata Taking the model from training to production

Thank You

Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

Technology

Transcript of Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs

Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

Apache Zeppelin and Spark for Enterprise Data Science

Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

Running Apache Zeppelin & Spark in Production

Integrating Apache Hive with Kafka, Spark, and BI...Community Connection: Integrating Apache Hive with Apache Spark--Hive Warehouse Connector Apache Spark-Apache Hive connection configuration

Developing Apache Spark Applications - Cloudera · Apache Spark Quick Start Apache Spark Overview Apache Spark Programming Guide Using the Spark DataFrame API A DataFrame is a distributed

with Kaa, Apache Cassandra, and Apache Zeppelin … · Real-time IoT data analytics and visualization with Kaa, Apache Cassandra, and Apache Zeppelin. Agenda Why Kaa? Why Cassandra?

Running Apache Spark & Apache Zeppelin in Production

Apache Zeppelin, the missing component for the Spark eco-system

State of Security: Apache Spark & Apache Zeppelin

Budapest Spark Meetup - Apache Spark @enbrite.ly

Apache spark

Apache spark meetup

Interactive Data Science Notebooks with Apache Zeppelin

Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spark Job-Server with Arvind Heda Kapil Malik

Using Apache Zeppelin - Cloudera · Spark" in the HDP Apache Spark guide. Configuring and Using Zeppelin Interpreters An Apache Zeppelin interpreter is a plugin that enables you to

Intro to Spark with Zeppelin

Hortonworks Data Platform - Apache Zeppelin Component …The following graphic shows process communication among Zeppelin, Livy, and Spark: On an Ambari-managed cluster, Livy is installed

Intro to Big Data Analytics using Apache Spark and Apache Zeppelin