Analytique temps réel sur des données transactionnelles
= Cassandra + Spark 20/02/15
Victor Coustenoble Ingénieur [email protected]@vizanalytics
Comment utilisez vous Cassandra?
3
En contrôlant votre
consommation d’énergie
En regardant des films
en streaming
En naviguant
sur des sites Internet
En achetant
en ligne
En effectuant un règlement
via Smart Phone
En jouant à des
jeux-vidéo très
connus
• Collections/Playlists
• Recommandation/Pe
rsonnalisation
• Détection de Fraude
• Messagerie
• Objets Connectés
Aperçu
Fondé en avril 2010
~35 500+
Santa Clara, Austin, New York, London, Paris, Sydney
400+Employés Pourcent Clients
4
Straightening the road
RELATIONAL DATABASES
CQL SQL
OpsCenter / DevCenter Management tools
DSE for search & analytics Integration
Security Security
Support, consulting & training 30 years ecosystem
Apache Cassandra™
• Apache Cassandra™ est une base de données NoSQL, Open Source, Distribuée et créée pour les applications en ligne, modernes, critiques et avec des montée en charge massive.
• Java , hybride entre Amazon Dynamo et Google BigTable
• Sans Maître-Esclave (peer-to-peer), sans Point Unique de Défaillance (No SPOF)
• Distribuée avec la possibilité de Data Center
• 100% Disponible
• Massivement scalable
• Montée en charge linéaire
• Haute Performance (lecture ET écriture)
• Multi Data Center
• Simple à Exploiter
• Language CQL (comme SQL)
• Outils OpsCenter / DevCenter
6
Dynamo
BigTable
BigTable: http://research.google.com/archive/bigtable-osdi06.pdf
Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
Node 1
Node 2
Node 3Node 4
Node 5
Haute Disponibilité et Cohérence
• La défaillance d’un seul noeud ne doit pas entraîner de défaillance du système
• Cohérence choisie au niveau du client
• Facteur de Réplication (RF) + Niveau de Cohérence (CL) = Succès
• Exemple:
• RF = 3
• CL = QUORUM (= 51% des replicas)
©2014 DataStax Confidential. Do not distribute without consent. 7
Node 1
1st copy
Node 4
Node 5Node 2
2nd copy
Node 3
3rd copy
Parallel
Write
Write
CL=QUORUM
5 μs ack
12 μs ack
12 μs ack
> 51% de réponses – donc la requête est réussie
CL(Read) + CL(Write) > RF => Cohérence Immédiate/Forte
DataStax Enterprise
Cassandra
Certifié,
Prêt pour
l’Entreprise
8
Security Analytics Search VisualMonitoring
ManagementServices
In-Memory
Dev. IDE & Drivers
ProfessionalServices
Support & Training
Confiance
d’utilisation
Fonctionnalités
d’Entreprise
DataStax Enterprise - Analytique
• Conçu pour faire des analyses sur des données Cassandra
• Il y a 4 façons de faire de l’Analytique sur des données Cassandra:
1. Recherche (Solr)
2. Analytique en mode Batch (Hadoop)
3. Analytique en mode Batch avec des outils Externe (Cloudera, Hortonworks)
4. Analytique Temps Réel
©2014 DataStax Confidential. Do not distribute without consent.
Why Spark on Cassandra?
• Analytics on transactional data and operational applications
• Data model independent queries
• Cross-table operations (JOIN, UNION, etc.)
• Complex analytics (e.g. machine learning)
• Data transformation, aggregation, etc.
• Stream processing
• Better performances than Hadoop Map/Reduce
Real-time Big Data
©2014 DataStax Confidential. Do not distribute without consent. 12
Data Enrichment
Batch Processing
Machine Learning
Pre-computed
aggregates
Data
NO ETL
Real-Time Big Data Use Cases
• Recommendation Engine
• Internet of Things
• Fraud Detection
• Risk Analysis
• Buyer Behaviour Analytics
• Telematics, Logistics
• Business Intelligence
• Infrastructure Monitoring
• …
©2014 DataStax Confidential. Do not distribute without consent. 13
Composants Sparks
Shark
or
Spark SQLStructured
Spark
StreamingReal-time
MLlibMachine learning
Spark (General execution engine)
GraphXGraph
Cassandra
Compatible
Cassandra
Executor
ExecutorSpark
Worker
(JVM)
Cassandra
Executor
ExecutorSpark
Worker
(JVM)
DSE Spark Integration Architecture
Node 1
Node 2
Node 3
Node 4
Cassandra
Executor
ExecutorSpark
Worker
(JVM)
Cassandra
Executor
ExecutorSpark
Worker
(JVM)
Spark
Master
(JVM)
App
Driver
Spark Cassandra Connector
C*
C*
C*C*
Spark Executor
C* Java Driver
Spark-Cassandra Connector
User Application
Cassandra
Cassandra Spark Driver
•Cassandra tables exposed as Spark RDDs
•Load data from Cassandra to Spark
•Write data from Spark to Cassandra
•Object mapper : Mapping of C* tables and rows to Scala objects
•Type conversions : All Cassandra types supported and converted to Scala types
•Server side data selection
•Virtual Nodes support
•Scala and Java APIs
DSE Spark Interactive Shell
$ dse spark...Spark context available as sc.HiveSQLContext available as hc.CassandraSQLContext available as csc.
scala> sc.cassandraTable("test", "kv")res5: com.datastax.spark.connector.rdd.CassandraRDD
[com.datastax.spark.connector.CassandraRow] = CassandraRDD[2] at RDD at CassandraRDD.scala:48
scala> sc.cassandraTable("test", "kv").collectres6: Array[com.datastax.spark.connector.CassandraRow] =
Array(CassandraRow{k: 1, v: foo})
cqlsh> select * from test.kv;
k | v---+-----1 | foo
(1 rows)
Connecting to Cassandra
// Import Cassandra-specific functions on SparkContext and RDD objectsimport com.datastax.driver.spark._
// Spark connection optionsval conf = new SparkConf(true)
.setMaster("spark://192.168.123.10:7077")
.setAppName("cassandra-demo").set("cassandra.connection.host", "192.168.123.10") // initial
contact.set("cassandra.username", "cassandra").set("cassandra.password", "cassandra")
val sc = new SparkContext(conf)
Reading Data
val table = sc
.cassandraTable[CassandraRow]("db", "tweets")
.select("user_name", "message")
.where("user_name = ?", "ewa")
row
representation keyspace table
server side column
and row selection
Writing Data
CREATE TABLE test.words(word TEXT PRIMARY KEY, count INT);
val collection = sc.parallelize(Seq(("foo", 2), ("bar", 5)))collection.saveToCassandra("test", "words", SomeColumns("word", "count"))
cqlsh:test> select * from words;
word | count
------+-------
bar | 5
foo | 2
(2 rows)
Mapping Rows to Objects
CREATE TABLE test.cars (id text PRIMARY KEY,model text,fuel_type text,year int
);
case class Vehicle(id: String,model: String,fuelType: String,year: Int
)
sc.cassandraTable[Vehicle]("test", "cars").toArray//Array(Vehicle(KF334L, Ford Mondeo, Petrol, 2009),// Vehicle(MT8787, Hyundai x35, Diesel, 2011)
* Mapping rows to Scala Case Classes
* CQL underscore case column mapped to Scala camel case property
* Custom mapping functions (see docs)
Type MappingCQL Type Scala Type
ascii String
bigint Long
boolean Boolean
counter Long
decimal BigDecimal, java.math.BigDecimal
double Double
float Float
inet java.net.InetAddress
int Int
list Vector, List, Iterable, Seq, IndexedSeq, java.util.List
map Map, TreeMap, java.util.HashMap
set Set, TreeSet, java.util.HashSet
text, varchar String
timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime
timeuuid java.util.UUID
uuid java.util.UUID
varint BigInt, java.math.BigInteger
*nullable values Option
Shark
• SQL query engine on top of Spark
• Not part of Apache Spark
• Hive compatible (JDBC, UDFs, types, metadata, etc.)
• Supports in-memory tables
• Available as a part of DataStax Enterprise
Spark SQL• Spark SQL supports a subset of SQL-92 language
• Spark SQL optimized for Spark internals (e.g. RDDs) , better performances than Shark
• Support for in-memory computation
•From Spark command line
•Mapping of Cassandra keyspaces and tables
•Read and write on Cassandra tables
Usage of Spark SQL & HiveQL query
import com.datastax.spark.connector._
// Connect to the Spark clusterval conf = new SparkConf(true)...val sc = new SparkContext(conf)
// Create Cassandra SQL contextval cc = new CassandraSQLContext(sc)
// Execute SQL queryval rdd = cc.sql("INSERT INTO ks.t1 SELECT c1,c2 FROM ks.t2")
// Execute HQL queryval rdd = cc.hql("SELECT * FROM keyspace.table JOIN ... WHERE ...")
Spark Streaming
• For real time analytics
• Push or pull model
• Stream TO and FROM Cassandra
• Micro batching (each batch represented as RDD)
• Fault tolerant
• Data processed in small batches
• Exactly-once processing
• Unified stream and batch processing framework
• Supports Kafka, Flume, ZeroMQ, Kinesis, MQTT
producers
Usage of Spark Streaming
• Due to the unifying Spark architecture, portions of batch and streaming development can be reused
• Given that Spark Streaming is backed by Cassandra, no need to depend uponsolutions like Apache Zookeeper ™ in production
import com.datastax.spark.connector.streaming._
// Spark connection optionsval conf = new SparkConf(true)...
// streaming with 1 second batch windowval ssc = new StreamingContext(conf, Seconds(1))
// stream inputval lines = ssc.socketTextStream(serverIP, serverPort)
// count wordsval wordCounts = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
// stream outputwordCounts.saveToCassandra("test", "words")
// start processingssc.start() ssc.awaitTermination()
Python API
$ dse pysparkPython 2.7.8 (default, Oct 20 2014, 15:05:19) [GCC 4.9.1] on linux2Type "help", "copyright", "credits" or "license" for more information.Welcome to
____ __/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_//__ / .__/\_,_/_/ /_/\_\ version 1.1.0
/_/
Using Python version 2.7.8 (default, Oct 20 2014 15:05:19)SparkContext available as sc.>>> sc.cassandraTable("test", "kv").collect()[Row(k=1, v=u'foo')]
DataStax Enterprise + Spark Special Features
•Easy setup and config
• no need to setup a separate Spark cluster
• no need to tweak classpaths or config files
•High availability of Spark Master
•Enterprise security
• Password / Kerberos / LDAP authentication
• SSL for all Spark to Cassandra connections
•CFS integration (no SPOF distributed file system)
•Cassandra access through Spark Python API
•Certified and Supported on Cassandra
•Shark availability
DataStax Enterprise - High Availability
• All nodes are Spark Workers
• By default resilient to Worker failures
• First Spark node promoted as Spark Master (state saved
in CFS, no SPOF)
• Standby Master promoted on failure (New Spark Master
reconnects to Workers and the driver app and continues the job)
With DataStax Enterprise
34
C* SparkMSparkW
C* SparkW*
C* SparkWC* SparkW
C* SparkW
Master state in C*
Spare master for H/A
Spark Use Cases
35
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize data
Schema migration,
Data conversion
DataStax Enterprise© 2014 DataStax, All Rights Reserved. Company Confidential
External Hadoop Distribution
Cloudera, Hortonworks
OpsCenter
Services
Hadoop
Monitoring
Operations
Operational
Application
Real Time
Search
Real Time
Analytics
Batch
Analytics
SGBDR
Analytics
Transformation
s
36
Cassandra Cluster – Nodes Ring – Column Family Storage
High Performance – Alway Available – Massive Scalability
Advanced
Security
In-Memory
How to Spark on Cassandra?
DataStax Cassandra Spark driver
https://github.com/datastax/cassandra-driver-spark
Compatible with
•Spark 1.2
•Cassandra 2.0.x and 2.1.x
•DataStax Enterprise 4.5 et 4.6
DataStax Enterprise 4.6 = Cassandra 2.0 + Driver + Spark 1.1
Spark 1.2 in next DSE 4.7 version (March)
Merci Questions ?
We power the big data apps that transform business.
©2013 DataStax Confidential. Do not distribute without consent.
@vizanalytics
Top Related