ETL in Clojure

49
ETL in Clojure ETL in Clojure Dmitriy Morozov / JEEConf 2015

Transcript of ETL in Clojure

ETL in ClojureETL in Clojure

Dmitriy Morozov / JEEConf 2015

Dmitriy MorozovDmitriy Morozov

Software engineer at Functional programming junkyOccasional cyclist

Zoomdata.com

@argc

Plan of attackPlan of attack

ETL at ZoomdataETL at Zoomdata

CascalogCascalog

SparkSpark

DemoDemo

ConclusionConclusion

Is a modern BI application focused onIs a modern BI application focused onallowing everyday business users toallowing everyday business users tobe able to visually interact andbe able to visually interact andexplore their data and discoverexplore their data and discoverinsight out of that data.insight out of that data.

What we do at ZoomdataWhat we do at Zoomdata

What we do at ZoomdataWhat we do at Zoomdata

We did ETL inWe did ETL inHive/ImpalaHive/Impala

Using SQL for ETLUsing SQL for ETL

Hive is slow, and so is Hive on TezSQL is horrible for doing anything complicatedCode is hard to maintain, reuse and test

Lessons learnedLessons learned

Why Clojure?Why Clojure?

Functional!

Runs on JVM

Interactive development

Zero delta between prototyp code andproduction code

CascalogCascalog

Datalog DSL in CLojure

Built on top of Hadoop and Cascading

Query compiles to Hadoop MapReduce jobs

Supports local execution for prototyping

Great testing story

DatalogDatalog

language

Syntactically is a subset of Prolog

It is often used as a fordeductive databases.

Query statements can be stated in any order

Logic programming

query language

DatalogDatalog

Word Count using Hadoop API

Word count in CascalogWord count in Cascalog

Cascalog Query StructureCascalog Query Structure

Cascalog / GeneratorsCascalog / Generators

Cascalog / OperationsCascalog / Operations

Cascalog / OperationsCascalog / Operations

Cascalog / JoinsCascalog / Joins

Cascalog / OperationsCascalog / Operations

Cascalog / AggregatorsCascalog / Aggregators

Cascalog / AggregatorsCascalog / Aggregators

Cascalog / TroubleshootingCascalog / Troubleshooting

Cascalog / TestingCascalog / Testing

Cascalog / TroubleshootingCascalog / Troubleshooting

Flow Visualisation / Flow Visualisation / DOTDOT

Flow Visualisation / Flow Visualisation / DrivenDriven

DEMODEMO

Cascalog DownsidesCascalog Downsides

Hadoop < Spark Hadoop < Spark **

Cascalog DownsidesCascalog Downsides

No supportNo supportfor streamingfor streaming

datadata

Cascalog DownsidesCascalog Downsides

What are the alternatives?What are the alternatives?

Java API for Java API for

FlamboFlamboSparklingSparkling

SparkSpark

Customer XCustomer X

Customer X wants to do DataCustomer X wants to do DataScience!Science!

Drug PersistenceDrug Persistence

Determining whether a patient isDetermining whether a patient ispersistent or not based on whether shepersistent or not based on whether she

refilled the prescription in time.refilled the prescription in time.

Drug PersistenceDrug Persistence

Drug PersistenceDrug Persistence

Drug PersistenceDrug Persistence

Drug PersistenceDrug Persistence

Example: Drug PersistenceExample: Drug Persistence

Things to check outThings to check out

How Yieldbot does Data science in ClojureCascalog for the ImpatientStreaming MapReduce in ClojureSparklingFlambo

Thank you!Thank you!