Harnessing Spark and Cassandra with Groovy

87
Harnessing the Power of Spark + Cassandra with Groovy Steve Pember CTO, ThirdChannel Gr8Conf US, 2017 @svpember

Transcript of Harnessing Spark and Cassandra with Groovy

Page 1: Harnessing Spark and Cassandra with Groovy

Harnessing the Power of Spark + Cassandra with Groovy

Steve Pember CTO, ThirdChannel Gr8Conf US, 2017

@svpember

Page 2: Harnessing Spark and Cassandra with Groovy

Relational Database are Fantastic

Page 3: Harnessing Spark and Cassandra with Groovy

SQL makes you Strong

Page 4: Harnessing Spark and Cassandra with Groovy
Page 5: Harnessing Spark and Cassandra with Groovy
Page 6: Harnessing Spark and Cassandra with Groovy
Page 7: Harnessing Spark and Cassandra with Groovy
Page 8: Harnessing Spark and Cassandra with Groovy

@svpember

Page 9: Harnessing Spark and Cassandra with Groovy

@svpember

Agenda• Spark

• Cassandra

• Spark + Cassandra

• Working with Spark + Cassandra

• Demo

Page 10: Harnessing Spark and Cassandra with Groovy

@svpember

Apache Spark• Distributed Execution Engine

Page 11: Harnessing Spark and Cassandra with Groovy

–Johnny Appleseed

“Type a quote here.”

Page 12: Harnessing Spark and Cassandra with Groovy

@svpember

Apache Spark• Distributed Execution Engine

• What about Hadoop?

Page 13: Harnessing Spark and Cassandra with Groovy

@svpember

Hadoop Spark• Map / Reduce

• Storage via HDFS

• Each calculation step written to disk

• More than Map/Reduce

• No dependent storage mechanism

• Clustered Calculations, each step in memory

Page 14: Harnessing Spark and Cassandra with Groovy

@svpember

Apache Spark• Distributed Execution Engine

• What about Hadoop?

• Creation was a Happy Accident

Page 15: Harnessing Spark and Cassandra with Groovy

–Johnny Appleseed

“Type a quote here.”

Page 16: Harnessing Spark and Cassandra with Groovy
Page 17: Harnessing Spark and Cassandra with Groovy

@svpember

Apache Spark• Distributed Execution Engine

• What about Hadoop?

• Creation was a Happy Accident

• Architecture

Page 18: Harnessing Spark and Cassandra with Groovy

–Johnny Appleseed

“Type a quote here.”

Page 19: Harnessing Spark and Cassandra with Groovy

Your Groovy App

Page 20: Harnessing Spark and Cassandra with Groovy

@svpember

Apache Spark• Distributed Execution Engine

• What about Hadoop?

• Creation was a Happy Accident

• Architecture

• Programatic structure

Page 21: Harnessing Spark and Cassandra with Groovy

The SparkContext submits Jobs to the Cluster

Page 22: Harnessing Spark and Cassandra with Groovy

Operations are performed against RDDs

Page 23: Harnessing Spark and Cassandra with Groovy

@svpember

Resilient Distributed Dataset• Immutable

• Partitioned

• Parallel operations

• Created by performing operations on other RDDs

• Reusable & Composable

Page 24: Harnessing Spark and Cassandra with Groovy

@svpember

Page 25: Harnessing Spark and Cassandra with Groovy

@svpember

Apache Spark• Distributed Execution Engine

• What about Hadoop?

• Creation was a Happy Accident

• Architecture

• Programatic structure

• APIs

Page 26: Harnessing Spark and Cassandra with Groovy

More Than Map/Reduce

Page 27: Harnessing Spark and Cassandra with Groovy

@svpember

RDD operations• map

• reduce

• filter

• flatmap

• zip

• groupBy

• … plus many more

Page 28: Harnessing Spark and Cassandra with Groovy

–Johnny Appleseed

“Type a quote here.”

Page 29: Harnessing Spark and Cassandra with Groovy

@svpember

Apache Spark• Distributed Execution Engine

• What about Hadoop?

• Creation was a Happy Accident

• Architecture

• Programatic structure

• APIs

• Additional Modules

Page 30: Harnessing Spark and Cassandra with Groovy

Spark SQL…!

Page 31: Harnessing Spark and Cassandra with Groovy
Page 32: Harnessing Spark and Cassandra with Groovy

JDBC?

Page 33: Harnessing Spark and Cassandra with Groovy

Spark Streaming!

Page 34: Harnessing Spark and Cassandra with Groovy
Page 35: Harnessing Spark and Cassandra with Groovy

@svpember

Agenda• Spark

• Cassandra

Page 36: Harnessing Spark and Cassandra with Groovy

@svpember

Apache Cassandra (C*)• NoSql Datastore

Page 37: Harnessing Spark and Cassandra with Groovy

@svpember

Apache Cassandra (C*)• NoSql Datastore

• Distributed

Page 38: Harnessing Spark and Cassandra with Groovy

Deterministic Distribution

Page 39: Harnessing Spark and Cassandra with Groovy

@svpember

Page 40: Harnessing Spark and Cassandra with Groovy

@svpember

Apache Cassandra (C*)• NoSql Datastore

• Distributed

• High Replication

Page 41: Harnessing Spark and Cassandra with Groovy

@svpember

Page 42: Harnessing Spark and Cassandra with Groovy

@svpember

Page 43: Harnessing Spark and Cassandra with Groovy

@svpember

Apache Cassandra (C*)• NoSql Datastore

• Distributed

• High Replication

• High Durability

Page 44: Harnessing Spark and Cassandra with Groovy

@svpember

Page 45: Harnessing Spark and Cassandra with Groovy

@svpember

Apache Cassandra (C*)• NoSql Datastore

• Distributed

• High Replication

• High Durability

• Linear Scalability

Page 46: Harnessing Spark and Cassandra with Groovy

Each new Node results in increased Storage with no loss

in performance

Page 47: Harnessing Spark and Cassandra with Groovy

@svpember

Page 48: Harnessing Spark and Cassandra with Groovy

@svpember

Apache Cassandra (C*)• NoSql Datastore

• Distributed

• High Replication

• High Durability

• Linear Scalability

• Data Model (CQL)

Page 49: Harnessing Spark and Cassandra with Groovy

Column Oriented Database

Page 50: Harnessing Spark and Cassandra with Groovy

But it’s SQL-like!

Page 51: Harnessing Spark and Cassandra with Groovy

@svpember

Page 52: Harnessing Spark and Cassandra with Groovy

@svpember

Page 53: Harnessing Spark and Cassandra with Groovy

@svpember

Page 54: Harnessing Spark and Cassandra with Groovy

Querying

Page 55: Harnessing Spark and Cassandra with Groovy

@svpember

C* Querying• select * from

• all queries must include partition key(s) in where clause

• order by limited to group keys

• cannot alter keys, queries must always be by same keys

Page 56: Harnessing Spark and Cassandra with Groovy

@svpember

Apache Cassandra (C*)• NoSql Datastore

• Distributed

• High Replication

• High Durability

• Linear Scalability

• Data Model (CQL)

• Designing your Data Model

Page 57: Harnessing Spark and Cassandra with Groovy

@svpember

Page 58: Harnessing Spark and Cassandra with Groovy

@svpember

Page 59: Harnessing Spark and Cassandra with Groovy

@svpember

Agenda• Spark

• Cassandra

• Spark + Cassandra

Page 60: Harnessing Spark and Cassandra with Groovy
Page 61: Harnessing Spark and Cassandra with Groovy

@svpember

Spark + Cassandra• Reduce each other’s weaknesses

• Filter on the server side (with c*)

• Join tables, filter results (with Spark)

Page 62: Harnessing Spark and Cassandra with Groovy

Companies have been formed

Page 63: Harnessing Spark and Cassandra with Groovy

–Johnny Appleseed

“Type a quote here.”

Page 64: Harnessing Spark and Cassandra with Groovy

Cluster Design

Page 65: Harnessing Spark and Cassandra with Groovy

@svpember

Page 66: Harnessing Spark and Cassandra with Groovy

Data Locality!

Page 67: Harnessing Spark and Cassandra with Groovy

@svpember

Page 68: Harnessing Spark and Cassandra with Groovy

@svpember

Page 69: Harnessing Spark and Cassandra with Groovy

Pipeline architecture

Page 70: Harnessing Spark and Cassandra with Groovy

@svpember

Page 71: Harnessing Spark and Cassandra with Groovy

@svpember

Agenda• Spark

• Cassandra

• Spark + Cassandra

• Working with Spark + Cassandra

Page 72: Harnessing Spark and Cassandra with Groovy

Coding Spark + C*

Page 73: Harnessing Spark and Cassandra with Groovy

@svpember

Terminology• SparkConf

• JavaSparkContext

• JavaFunctions

• Mappers

Page 74: Harnessing Spark and Cassandra with Groovy

@svpember

Page 75: Harnessing Spark and Cassandra with Groovy

@svpember

Spark Conf• spark.master -> url to the master node

• spark.app.name -> want to see your client show up in the Spark UI?

• spark.executor.memory -> Limits memory per executor on workers

• spark.executor.cores -> limits cores on each worker (need to share with c*!)

• spark.submit.deployMode -> ‘client’ or ‘cluster

• spark.jars.packages -> maven / gradle type names

• spark.jars.ivy -> specify custom repos for packages

• more at: http://spark.apache.org/docs/latest/configuration.html#available-properties

Page 76: Harnessing Spark and Cassandra with Groovy

@svpember

Master Url Overloading• “local” -> use Spark in stand alone mode. One thread

• “local[<K>]” -> Spark, stand alone, with K threads

• “local[*]” -> Spark, stand alone, with ALL YOUR THREADS!

• “spark://<host string>:<port>” -> url for a Spark cluster master node, using Spark’s cluster management

• also options for Mesos and Yarn

Page 77: Harnessing Spark and Cassandra with Groovy

@svpember

Page 78: Harnessing Spark and Cassandra with Groovy

However, a Warning

Page 79: Harnessing Spark and Cassandra with Groovy
Page 80: Harnessing Spark and Cassandra with Groovy
Page 81: Harnessing Spark and Cassandra with Groovy

But where does my code live?

Page 82: Harnessing Spark and Cassandra with Groovy

@svpember

Page 83: Harnessing Spark and Cassandra with Groovy

@svpember

CLASS_PATH: org.apache.spark,

com.fasterxml.jackson, com.yourco.yourapp.pojos.*

CLASS_PATH: org.apache.spark,

com.fasterxml.jackson

CLASS_PATH: org.apache.spark,

com.fasterxml.jackson

Page 84: Harnessing Spark and Cassandra with Groovy

@svpember

Agenda• Spark

• Cassandra

• Spark + Cassandra

• Working with Spark + Cassandra

• Demo

Page 85: Harnessing Spark and Cassandra with Groovy

Thank You!

@svpember

Page 86: Harnessing Spark and Cassandra with Groovy

@svpember

Links• Cassandra on AWS official Whitepaper: https://d0.awsstatic.com/whitepapers/Cassandra_on_AWS.pdf

• Demo code: https://github.com/spember/ratpack-spark-cassandra-demo

Page 87: Harnessing Spark and Cassandra with Groovy

@svpember

Images• Database Sharding: https://dzone.com/articles/ebay-secret-database-scaling

• Indian Jones Warehouse: http://logisticalfictions.tumblr.com/page/9

• Strong (Spongebob): www.reactiongifs.com/strongbob/?utm_source=rss&utm_medium=rss&utm_campaign=strongbob

• Cheetah: www.livescience.com/21944-usain-bolt-vs-cheetah-animal-olympics.html

• Big Data Cartoon: http://www.kdnuggets.com/2016/08/cartoon-make-data-great-again.html

• Spark Streaming: http://velvia.github.io/presentations/2015-filodb-spark-streaming/#/

• Picard + Riker: http://www.douxreviews.com/2015/09/star-trek-next-generation-matter-of.html

• Software Engineers: http://pyxurz.blogspot.com/2011/10/office-space-page-2-of-6.html