Apache Spark™ Applications the Easy Way - Pierre Borckmans

Post on 16-Apr-2017

800 views 1 download

Transcript of Apache Spark™ Applications the Easy Way - Pierre Borckmans

Wri$ng Spark applica$ons,the easy way

¨¨Pierre Borckmans

Data Science Meetup - Spark & Machine Learning - October, 27th 2016 - Brussels

The pivot...

Pla$orm overview

Data pipeline overview

The journey...3 paradigms for spark applica0ons development

From hardcoded dataflows...val subscribers = cdrs.map( x => ( x.A.toLong, x ) ).groupByKey

subscribers.mapValues(_.map( cdr => { for ( ( category, dimensions ) <- allDimensions ) yield ( category, for ( dim <- dimensions ) yield { val fields = dim._1 val values = dim._2 if ( cdr.check( fields, values ) ) f( category )( cdr ) else f0( category ) } )} ).reduce( ( m1, m2 ) => { for ( ( category, l1 ) <- m1 ) yield { val l2 = m2( category ) val d = l1.zip( l2 ).map( l => { g( category )( l._1, l._2 ) } ) ( category, d ) }} ) )

...to fully interac/ve ones...

and back to code...

... with benefits !

Harcoded dataflows

Dataflow Editor

Dataflow EditorShow &me!

Video

Video

Datamodules• self-contained units of the pipeline

• expressing dependencies on sources and other dms

• recycling the dataflow engine

• DSL to declare dataflows

• unit test DSL to test flow and individual transforma=ons

• sbt plugin to handle all devops related tasks

• automa=c orchestra=on through Airflow

Dataflow DSL

Dataflow Test DSL

Automated Data Modules Orchestra2on

Data Module ExplorerShow &me!

Video

Video

Video