An efficient data mining solution by integrating Spark and Cassandra

37
AN EFFICIENT DATA MINING SOLUTION

description

Integrating C* and Spark gives us a system that combines the best of both worlds. The goal of this integration is to obtain a better result than using Spark over HDFS because Cassandra´s philosophy is much closer to RDD's philosophy than what HDFS is. The goal with Cassandra is to have a system that mines all the information stored in C* in a much more efficient way than having the information stored in HDFS. Cassandra data storage and Spark data mining power: an unrivalled mix.

Transcript of An efficient data mining solution by integrating Spark and Cassandra

Page 1: An efficient data mining solution by integrating Spark and Cassandra

AN EFFICIENT DATA MINING SOLUTION

Page 2: An efficient data mining solution by integrating Spark and Cassandra

Hadoop?

Page 3: An efficient data mining solution by integrating Spark and Cassandra

Cassandra?

Page 4: An efficient data mining solution by integrating Spark and Cassandra

Spark?

Page 5: An efficient data mining solution by integrating Spark and Cassandra

Stratio Deep

An efficient data mining solution

“Two and two are four?

Sometimes… Sometimes they are five.”

G. Orwell

#StratioBD

Page 6: An efficient data mining solution by integrating Spark and Cassandra
Page 7: An efficient data mining solution by integrating Spark and Cassandra

Goals

• Why do you need Cassandra?• What is the problem?• Why do you need Spark?• How do they work together?

#StratioBD

Page 8: An efficient data mining solution by integrating Spark and Cassandra

Cassandra

#StratioBD

• Based on DynamoDB…• Replication, Key/Value, P2P• And based on Big Table…• Column oriented

Page 9: An efficient data mining solution by integrating Spark and Cassandra

ROBUST FAST EFFICENT

Page 10: An efficient data mining solution by integrating Spark and Cassandra

NO BOTTLENECK REPLICATE

DDECENTRALIZED

Page 11: An efficient data mining solution by integrating Spark and Cassandra

Another Databas

e?

Page 12: An efficient data mining solution by integrating Spark and Cassandra

Why?

Page 13: An efficient data mining solution by integrating Spark and Cassandra

One User – Lot of data

Case A

#StratioBD

Page 14: An efficient data mining solution by integrating Spark and Cassandra

Many User – Few data

Case B

#StratioBD

Page 15: An efficient data mining solution by integrating Spark and Cassandra

Many user – Lot of data

Case C

#StratioBD

Page 16: An efficient data mining solution by integrating Spark and Cassandra

Crawler app

#StratioBD

Cassandra, I choose you

100M

Indexedpages

3kreads

Query time

< 1s

Page 17: An efficient data mining solution by integrating Spark and Cassandra

But…

Page 18: An efficient data mining solution by integrating Spark and Cassandra

Marketingwalks in

Page 19: An efficient data mining solution by integrating Spark and Cassandra

New query

“I need to find all the reference to the domain

ACME. I need the answer by Friday.”

#StratioBD

Page 20: An efficient data mining solution by integrating Spark and Cassandra

Problem

Cassandra is not well suited to resolved this

type of queries

You need to design the schema with the query

in mind

#StratioBD

Page 21: An efficient data mining solution by integrating Spark and Cassandra

ChallengeAccepted

Page 22: An efficient data mining solution by integrating Spark and Cassandra

What options do we have?

• Run Hive Query on top of C*• Write an ETL script and load data into

another DB• Clone the cluster

#StratioBD

Page 23: An efficient data mining solution by integrating Spark and Cassandra

What options do we have?

Run Hive Query on top of C*

Write ETL scripts and load into another DB

Clone the cluster

#StratioBD

Page 24: An efficient data mining solution by integrating Spark and Cassandra

And now… what can we do?

“We can't solve problems by using the same kind of thinking

we used when we created them”

#StratioBD

Albert Einstein

Page 25: An efficient data mining solution by integrating Spark and Cassandra

• Alternative to MapReduce• A low latency cluster computing system• For very large datasets• Create by UC Berkeley AMP Lab in 2010.• May be 100 times faster than MapReduce for:

Interactive algorithms. Interactive data mining

Spark

#StratioBD

Page 26: An efficient data mining solution by integrating Spark and Cassandra

Logistic regression inSpark vs Hadoop

SOURCE | http://spark.incubator.apache.org/

#StratioBD

Page 27: An efficient data mining solution by integrating Spark and Cassandra

WHO USES SPARK?

Page 28: An efficient data mining solution by integrating Spark and Cassandra

Spark and Cassandra

Integration points

#StratioBD

Page 29: An efficient data mining solution by integrating Spark and Cassandra

Cassandra’s HDFS abstraction layer

Advantantages:• Easily integrates with legacy systems.

Drawbacks:• Very high-level: no access to low level Cassandra’s features.• Questionable performance.

INTEGRATION POINTS: HDFS OVER CASSANDRA

#StratioBD

Page 30: An efficient data mining solution by integrating Spark and Cassandra

Cassandra’s Hadoop Interface• Thrift protocol• CQL3 (our implementation)

Uses the novel Cassandra’s

CqlPagingInputFormat

INTEGRATION POINTS: HDFS OVER CASSANDRA

#StratioBD

Page 31: An efficient data mining solution by integrating Spark and Cassandra

• Supports CQL3 features• Respects data locality • Good compromise between performance / implementation complexity

CQL3 Integration

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

#StratioBD

Page 32: An efficient data mining solution by integrating Spark and Cassandra

CQL3 Integration (II)

Provides a Java friendly API:

• Developers map Column Families to custom serializable

POJOs

• StratioDeep wraps the complexity of performing Spark

calculations directly over the user provided POJOs.

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

#StratioBD

Page 33: An efficient data mining solution by integrating Spark and Cassandra

Demo

Page 34: An efficient data mining solution by integrating Spark and Cassandra

Drawbacks:

• Still not preforming as well as we’d like

Uses Cassandra’s Hadoop Interface• No analyst-friendly interface:

No SQL-like query features

CQL3 Integration (III)

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3#StratioBD

Page 35: An efficient data mining solution by integrating Spark and Cassandra

Bring the integration to another level:

• Dump Cassandra’s Hadoop Interface• Direct access to Cassandra’s SSTable(s) files.• Extend Cassandra’s CQL3 to make use of Spark’s

distributed data processing power

Future extensions

What are we currently working on?

#StratioBD

Page 36: An efficient data mining solution by integrating Spark and Cassandra

#StratioBD

Conclusion

Page 37: An efficient data mining solution by integrating Spark and Cassandra

THANKS