Analytics with Cassandra & Spark

45
Big Data Analytics with Cassandra & Spark_ Matthias Niehoff

Transcript of Analytics with Cassandra & Spark

Big Data Analytics with Cassandra & Spark_

Matthias Niehoff

Cassandra

2

•Distributed database

•Highly Available

•Linear Scalable

•Multi Datacenter Support

•No Single Point Of Failure

•CQL Query Language • Similiar to SQL • No Joins and aggregates

• Eventual Consistency „Tunable Consistency“

Cassandra_

3

Distributed Data Storage_

4

Node 1

Node 2

Node 3

Node 4

1-25

26-50 51-75

76-0

CQL - Querying Language With Limitations_

5

SELECT*FROMperformerWHEREname='ACDC'—>ok

SELECT*FROMperformerWHEREname='ACDC'andcountry='Australia'—>notok

SELECTcountry,COUNT(*)asquantityFROMartistsGROUPBYcountryORDERBYquantityDESC—>notsupported

performername (PK)

genrecountry

Spark

6

•Open Source & Apache project since 2010

•Data processing Framework • Batch processing • Stream processing

What Is Apache Spark_

7

•Fast • up to 100 times faster than Hadoop • a lot of in-memory processing • linear scalable using more nodes

• Easy • Scala, Java and Python API • Clean Code (e.g. with lambdas in Java 8) • expanded API: map, reduce, filter, groupBy, sort, union, join,

reduceByKey, groupByKey, sample, take, first, count

• Fault-Tolerant • easily reproducible

Why Use Spark_

8

•RDD‘s – Resilient Distributed Dataset • Read–Only description of a collection of objects • Distributed among the cluster (on memory or disk) • Determined through transformations • Allows automatically rebuild on failure

•Operations • Transformations (map,filter,reduce...) —> new RDD • Actions (count, collect, save)

•Only Actions start processing!

Easily Reproducable?_

9

RDD Example_

10

scala>valtextFile=sc.textFile("README.md")textFile:spark.RDD[String]=spark.MappedRDD@2ee9b6e3

scala>vallinesWithSpark=textFile.filter(line=>line.contains("Spark"))linesWithSpark:spark.RDD[String]=spark.FilteredRDD@7dd4af09

scala>linesWithSpark.count()res0:Long=126

Reproduce RDD’s Using A DAG_

11

Datenquelle

rdd1

rdd3

val1 rdd5

rdd2

rdd4

val2

rdd6

val3

map(..)filter(..)

union(..)

count()

count() count()

sample(..)

cache()

Run Spark In A Cluster_

12

([atomic,collection,object],[atomic,collection,object])

valfluege=List(("Thomas","Berlin"),("Mark","Paris"),("Thomas","Madrid"))

valpairRDD=sc.parallelize(fluege)

pairRDD.filter(_._1=="Thomas").collect.foreach(t=>println(t._1+"flewto"+t._2))

Pair RDDs_

13

key – not unique value

• Parallelization! • keys are use for partitioning

• pairs with different keys are distributed across the cluster

• Efficient processing of • aggregate by key

• group by key

• sort by key

• joins, union based on keys

Why Use Pair RDD’s_

14

Spark & Cassandra

15

Use Spark And Cassandra In A Cluster_

16

Spark

Client Spark

Driver

C*

C*

C*C*

Spark WN

Spark WN

Spark WN

Spark WN

Spark Master

Two Datacenter - Two Purposes_

17

C*

C*

C*C*

C*

C*

C*C*

Spark WN

Spark WNSpark

WN

Spark WN

Spark Master

DC1 - Online DC2 - Analytics

•Spark Cassandra Connector by Datastax • https://github.com/datastax/spark-cassandra-connector

• Cassandra tables as Spark RDD (read & write)

• Mapping of C* tables and rows onto Java/Scala objects

• Server-Side filtering („where“)

• Compatible with • Spark ≥ 0.9 • Cassandra ≥ 2.0

•Clone & Compile with SBT or download at Maven Central

Connecting Spark With Cassandra_

18

• Start the shell bin/spark-shell--jars~/path/to/jar/spark-cassandra-connector-assembly-1.3.0.jar--confspark.cassandra.connection.host=localhost--driver-memory2g--executor-memory3g

• ImportCassandraClassesscala>importcom.datastax.spark.connector._

Use The Connector In The Shell_

19

• Read complete table valmovies=sc.cassandraTable("movie","movies")//returnsCassandraRDD[CassandraRow]

• Read selected columns valmovies=sc.cassandraTable("movie","movies").select("title","year")

• Filter rows valmovies=sc.cassandraTable("movie","movies").where("title='DieHard'")

• Access Columns in Result Set movies.collect.foreach(r=>println(r.get[String]("title")))

Read A Cassandra Table_

20

Read As Tuple

valmovies=sc.cassandraTable[(String,Int)]("movie","movies").select("title","year")

valmovies=sc.cassandraTable("movie","movies").select("title","year").as((_:String,_:Int))

//bothresultinaCassandraRDD[(String,Int)]

Read A Cassandra Table_

21

Read As Case Class

caseclassMovie(title:String,year:Int)

sc.cassandraTable[Movie]("movie","movies").select("title","year")

sc.cassandraTable("movie","movies").select("title","year").as(Movie)

Read A Cassandra Table_

22

•Every RDD can be saved

• Using Tuples

valtuples=sc.parallelize(Seq(("Hobbit",2012),("96Hours",2008)))tuples.saveToCassandra("movie","movies",SomeColumns("title","year")

• Using Case Classes

caseclassMovie(title:String,year:int)valobjects=

sc.parallelize(Seq(Movie("Hobbit",2012),Movie("96Hours",2008)))objects.saveToCassandra("movie","movies")

Write Table_

23

//LoadandformatasPairRDDvalpairRDD=sc.cassandraTable("movie","director").map(r=>(r.getString("country"),r))

//Directors/Country,sortedpairRDD.mapValues(v=>1).reduceByKey(_+_).sortBy(-_._2).collect.foreach(println)

//or,unsortedpairRDD.countByKey().foreach(println)

//AllCountriespairRDD.keys()

Pair RDDs With Cassandra_

24

director

name text Kcountry text

• Joins can be expensive as they may require shuffling

valdirectors=sc.cassandraTable(..).map(r=>(r.getString("name"),r))

valmovies=sc.cassandraTable().map(r=>(r.getString("director"),r))

movies.join(directors)//RDD[(String,(CassandraRow,CassandraRow))]

Pair RDDs With Cassandra - Join

25

director

name text Kcountry text

movie

title text Kdirector text

•Automatically on read

•Not automatically on write • No Shuffling Spark Operations -> Writes are local • Shuffeling Spark Operartions

• Fan Out writes to Cassandra • repartitionByCassandraReplica(“keyspace“, “table“) before write

• Joins with data locality

Using Data Locality With Cassandra_

26

sc.cassandraTable[CassandraRow](KEYSPACE,A).repartitionByCassandraReplica(KEYSPACE,B).joinWithCassandraTable[CassandraRow](KEYSPACE,B).on(SomeColumns("id"))

•cassandraCount() • Utilizes Cassandra query • vs load the table into memory and do a count

•spanBy(), spanByKey() • group data by Cassandra partition key • does not need shuffling • should be preferred over groupBy/groupByKey

CREATE TABLE events (year int, month int, ts timestamp, data varchar, PRIMARY KEY (year,month,ts));

sc.cassandraTable("test","events").spanBy(row=>(row.getInt("year"),row.getInt("month")))sc.cassandraTable("test","events").keyBy(row=>(row.getInt("year"),row.getInt("month"))).spanByKey

Further Transformations & Actions_

27

Spark SQL

28

• SQL Queries with Spark (SQL & HiveQL) • On structured data • On DataFrame • Every result of Spark SQL is a DataFrame • All operations of the GenericRDD‘s available

• Supports (even on non primary key columns) • Joins • Union • Group By • Having • Order By

Spark SQL_

29

valsqlContext=newSQLContext(sc)valpersons=sqlContext.jsonFile(path)

//Showtheschemapersons.printSchema()

persons.registerTempTable("persons")

valadults=sqlContext.sql("SELECTnameFROMpersonsWHEREage>18")adults.collect.foreach(println)

Spark SQL - JSON Example_

30

{"name":"Michael"}{"name":"Jan","age":30}{"name":"Tim","age":17}

valcsc=newCassandraSQLContext(sc)

csc.setKeyspace("musicdb")

valresult=csc.sql("SELECTcountry,COUNT(*)asanzahl"+ "FROMartistsGROUPBYcountry"+ "ORDERBYanzahlDESC");

result.collect.foreach(println);

Spark SQL - Cassandra Example_

31

Spark Streaming

32

• Real Time Processing using micro batches

• Supported sources: TCP, S3, Kafka, Twitter,..

• Data as Discretized Stream (DStream)

• Same programming model as for batches

• All Operations of the GenericRDD & SQL & MLLib

• Stateful Operations & Sliding Windows

Stream Processing With Spark Streaming_

33

importorg.apache.spark.streaming._

valssc=newStreamingContext(sc,Seconds(1))

valstream=ssc.socketTextStream("127.0.0.1",9999)

stream.map(x=>1).reduce(_+_).print()

ssc.start()

//awaitmanualterminationorerrorssc.awaitTermination()

//manualterminationssc.stop()

Spark Streaming - Example_

34

Spark MLLib

35

• Fully integrated in Spark • Scalable • Scala, Java & Python APIs • Use with Spark Streaming & Spark SQL

• Packages various algorithms for machine learning

• Includes • Clustering • Classification • Prediction • Collaborative Filtering

• Still under development • performance, algorithms

Spark MLLib_

36

MLLib Example - Collaborative Filtering_

37

//Loadandparsethedata(userid,itemid,rating)valdata=sc.textFile("data/mllib/als/test.data")valratings=data.map(_.split(',')match{caseArray(user,item,rate)=>Rating(user.toInt,item.toInt,rate.toDouble)

})

//BuildtherecommendationmodelusingALSvalrank=10valnumIterations=20valmodel=ALS.train(ratings,rank,numIterations,0.01)

MLLib Example - Collaborative Filtering using ALS_

38

//EvaluatethemodelonratingdatavalusersProducts=ratings.map{caseRating(user,product,rate)=>(user,product)}

valpredictions=model.predict(usersProducts).map{ caseRating(user,product,rate)=>((user,product),rate)}

valratesAndPredictions=ratings.map{caseRating(user,product,rate)=>((user,product),rate)}.join(predictions)

valMSE=ratesAndPredictions.map{case((user,product),(r1,r2))=>valerr=(r1-r2);err*err}.mean()

println("MeanSquaredError="+MSE)

MLLib Example - Collaborative Filtering using ALS_

39

Use Cases

40

• In particular for huge amounts of external data

• Support for CSV, TSV, XML, JSON und other

Use Cases for Spark and Cassandra_

41

Data Loading

caseclassUser(id:java.util.UUID,name:String)

valusers=sc.textFile("users.csv") .repartition(2*sc.defaultParallelism) .map(line=>line.split(",")match{caseArray(id,name)=>User(java.util.UUID.fromString(id),name)})

users.saveToCassandra("keyspace","users")

Validate consistency in a Cassandra database

• syntactic • Uniqueness (only relevant for columns not in the PK) • Referential integrity • Integrity of the duplicates

• semantic • Business- or Application constraints • e.g.: At least one genre per movies, a maximum of 10 tags per blog

post

Use Cases for Spark and Cassandra_

42

Validation & Normalization

• Modelling, Mining, Transforming, ....

• Use Cases • Recommendation • Fraud Detection • Link Analysis (Social Networks, Web) • Advertising • Data Stream Analytics ( Spark Streaming) • Machine Learning ( Spark ML)

Use Cases for Spark and Cassandra_

43

Analyses (Joins, Transformations,..)

• Changes on existing tables • New table required when changing primary key • Otherwise changes could be performed in-place

• Creating new tables • data derived from existing tables • Support new queries

• Use the CassandraConnectors in Spark

Use Cases for Spark and Cassandra_

44

Schema Migration

Questions?

Matthias Niehoff, IT-Consultant

90

codecentric AG Zeppelinstraße 2 76185 Karlsruhe, Germany

mobil: +49 (0) 172.1702676 [email protected]

www.codecentric.de blog.codecentric.de

matthiasniehoff