Apache Spark™ Applications the Easy Way - Pierre Borckmans

Wri$ng Spark applica$ons,the easy way

¨¨Pierre Borckmans

Data Science Meetup - Spark & Machine Learning - October, 27th 2016 - Brussels

The pivot...

Pla$orm overview

Data pipeline overview

The journey...3 paradigms for spark applica0ons development

From hardcoded dataflows...val subscribers = cdrs.map( x => ( x.A.toLong, x ) ).groupByKey

subscribers.mapValues(_.map( cdr => { for ( ( category, dimensions ) <- allDimensions ) yield ( category, for ( dim <- dimensions ) yield { val fields = dim._1 val values = dim._2 if ( cdr.check( fields, values ) ) f( category )( cdr ) else f0( category ) } )} ).reduce( ( m1, m2 ) => { for ( ( category, l1 ) <- m1 ) yield { val l2 = m2( category ) val d = l1.zip( l2 ).map( l => { g( category )( l._1, l._2 ) } ) ( category, d ) }} ) )

...to fully interac/ve ones...

and back to code...

... with benefits !

Harcoded dataflows

Dataflow Editor

Dataflow EditorShow &me!

Datamodules• self-contained units of the pipeline

• expressing dependencies on sources and other dms

• recycling the dataflow engine

• DSL to declare dataflows

• unit test DSL to test flow and individual transforma=ons

• sbt plugin to handle all devops related tasks

• automa=c orchestra=on through Airflow

Dataflow DSL

Dataflow Test DSL

Automated Data Modules Orchestra2on

Data Module ExplorerShow &me!

Apache Spark™ Applications the Easy Way - Pierre Borckmans

Technology

Transcript of Apache Spark™ Applications the Easy Way - Pierre Borckmans

Introduction to Apache Spark

Hortonworks Data Platform - Apache Spark Component …€¦ · · 2018-04-15Hortonworks Data Platform: Apache Spark Component Guide ... Tuning Spark ... and debugging Spark shell

Spark SQL | Apache Spark

Apache spark meetup

Apache Spark Introduction

TeachYourself Apache Spark...HOUR 1 Introducing Apache Spark..... 1 2 Understanding Hadoop ... Part II: Programming with Apache Spark HOUR 6: Learning the Basics of Spark Programming

Apache Spark - Yandex

Writing Apache Spark and Apache Flink Applications Using Apache Bahir

Apache Ignite and Apache Spark - GridGain Systems · Ignite and Spark Integration Spark Application Spark Worker Spark Job Spark Job Yarn Mesos Docker HDFS Spark Worker Spark Job

Running Apache Spark & Apache Zeppelin in Production

Integrating Apache Hive with Kafka, Spark, and BI...Community Connection: Integrating Apache Hive with Apache Spark--Hive Warehouse Connector Apache Spark-Apache Hive connection configuration

State of Security: Apache Spark & Apache Zeppelin

Managed Solutions Apache Spark® · Apache Spark® Apache Spark™ is a high performing engine for large-scale analytics and data processing, While Apache Spark™ provides advanced

Apache Spark Operations

Apache Spark 101

Using Apache Spark, Apache Kafka and Apache Cassandra...USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS | 02 Apache Cassandra is well known

KNIME Extension for Apache Spark Installation Guide · Apache Livy (recommended) Spark Job Server (deprecated) Supported Spark and Hadoop distributions KNIME Extension for Apache

Apache Spark - LMU

Apache spark session

Developing Apache Spark Applications - Cloudera · Apache Spark Quick Start Apache Spark Overview Apache Spark Programming Guide Using the Spark DataFrame API A DataFrame is a distributed