Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium,...

16
` Big Data Pipeline @tuplejump

description

 

Transcript of Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium,...

Page 1: Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks.

`

Big Data Pipeline@tuplejump

Page 2: Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks.

A data engineering startup, with a vision to simplify data

engineering and empower the next generation of data powered

miracles!

tuplejump

Page 3: Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks.

Who Am I?

• Founder and CEO @ Tuplejump

• Earlier worked at Pramati, Cordys and couple of startups.

• A polyglot developer

• Started with Perl and PHP, have worked with VB.Net, C#, VC++, Erlang and Haskell

• Love data hacking in R and Python

• Java and Javascript fed me for a long long time

• Committed to Scala

• Believe in choosing the best tool for the task

• Open Source fanatic

@milliondreams | mytechrantings.blogspot.com

Page 4: Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks.

The big data pipeline

COLLECT TRANSFORM

PREDICT

STORE

EXPLORE VISUALIZE

Page 5: Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks.

The Tuplejump Platform

COLLECT TRANSFORM

PREDICT

STORE

EXPLORE VISUALIZE

Hydra

The tentacled framework to gather high volume and velocity data from push and pull powered by Akka, reacting on demands to events and streaming to Spark to batch process.

COLLECT

Spark + Calliope

Using the friendly Spark API with added features to easily consume or load data from and to Cassandra powered storage.

TRANSFORM

Cassandra++Cassandra provides a single storage mechanism for Files, (un)structured data, Generic data.

STORE

MinerBotBuilding on Spark's ML framework, going towards machine assisted insights, we are in building our own EA and ANN/DL frameworks to take ML to the next level.

PREDICT

Shark + CalliopeAd Hoc querying with shark on your data in Dstore.

UberCubeA OLAP cube engine

EXPLORE

Pissaro

A modern, game changing data frontend, which is “not just dashboards”, providing highly interactive and reactive visualization frontend.

VIZUALIZE

Page 6: Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks.

Advantages• All the advantages of Spark + All the advantages of Cassandra + Much more!

• Over 500x (100x in case of filtered data) faster than traditional Hadoop solutions

• Shark + C* provide for superfast ad hoc querying.

• UberCube empowers sub-millisecond responses on very large cubes

• MinerBot provides ready to use ML Algos, plus a possibility of much more complex algos and mechanisms than just map reduce.

• Ready to use, no integration required

• Easy to develop, deploy, monitor and scale

Page 7: Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks.

Why Scala?

• Object oriented and functional

• Runs on the JVM

• 100% compatible with Java

• Modern, evolving, scalable

• Concise, flexible and high performance

• Excellent support for DSL development

• Spark and Play use Scala as their primary language

• We used it for long and we love it!!!

Page 8: Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks.

The ultimate gyan!

You can flirt with other languages,

you can have short affairs with few,

You will fall in love with Scala at the first sight,

You have to marry her to know her!

Page 9: Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks.

Let’s call in some friends

• Akka - Actors to build concurrent and distributed applications

• Spark - The blue eyed whiz kid of the Big Data class

• Play - The web development champion

• SBT - The best builder in town

• ScalaTest - The story teller

• Shapeless and Scalaz - Masters of the Dark Arts

Page 10: Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks.

Concurrency With Akka

• Inspired by Erlang’s Actor Model

• Runs on the JVM

• Actors define behavior to handle typed messages

• Actors process one message at a time

• Can use Group/Pool of actors behind routers for concurrency

• Can run thousands of actos on a modern server

• Location transparency for clustering

• Supervision and state recovery for HA

Page 11: Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks.

Batch Processing with Spark

• Resilient Distributed Datasets

• Fast in-memory big data

• Map/Reduce on steroids

• Iterative and interactive

• Code in scala, java, python and now R

• Streaming (DStreams - Batch processing on streams)

• MLLib, Shark, Spark SQL, GraphX and more

Page 12: Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks.

Web development with Play

• Modern high velocity, highly scalable

• Built on Akka and Netty (NIO)

• Reactive in design (reactive I/O)

• Async HTTP, streaming HTTP, Comet, Websockets, build your own protocol

• Feature rich yet flexible

Page 13: Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks.

Build with SBT

• I hate writing XML

• Very easy to get started

• All the power of Scala in the build

• Maven dependency management + more

Page 14: Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks.

Testing with ScalaTest

• Write specs not tests and excellent tool for BDD

• Specs DSL very close to english

• Many testing styles

• Powerful matchers (“should be”)

• Fixtures

• Mock objects with ScalaMock

Page 15: Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks.

Taking functional further

• Shapeless

• Scrap your boilerplate

• Generic programming

• Existential types

• ScalaZ

• Bringing Haskel to Scala

• Monads, Functors and all the theory!

Page 16: Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks.

Thank you!

• http://www.tuplejump.com/

• http://github.com/tuplejump/

• http://tuplejump.github.com/calliope/

• http://tuplejump.github.com/stargate/

• @tuplejump on twitter