Download - Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla

Presto and ScyllaDB

Tzach Livyatan Eyal GutkindScyllaDB ScyllaDB

Agenda

• What is Presto?• Why Presto?• Scylla + Presto

▸ Connector▸ Examples

• What’s Next?

What is Presto?

• Distributed ANSI SQL Query Engine for Big Data

• Developed by Facebook in 2012

• Data sources: HDFS, S3, Cassandra, MySQL, Kafka,

PostgreSQL, Redis and Scylla

• Open Source

• Java Base

Why Presto?

PrestoInteractive Data Exploration

Source http://techblog.netflix.com/2014/10/usi ng-pres to-in-our-big-data-platform.html

SparkReal Time AnalyticsMachine LearningIterative

Why Presto?

• ANSI SQL

• Extensible - multiple data

sources

• Fast (compare to hive)

• Custom engine designed to

support SQL semanticsSource http://techblog.netflix.com/2014/10/usi ng-pres to-in-our-big-data-platform.html

Who use Presto ?

Presto Architecture

Source https://en.wikipedia.org/wiki/Pres to_(SQL_query_engine)

Presto Cassandra Connector

• DataStream API: ▪ CQL

• DataLocation API▪ Thrift: describe_ring and describe_splits_ex verbs

If number of partitions > 200

• Metadata API▪ CQL: get table layout

Presto Cassandra Connector -Configuration

• cassandra.contact-points• cassandra.consistency-level• cassandra.username / cassandra.password• limit-for-partition-key-select• cassandra.fetch-size• cassandra.load-policy.use-dc-aware (default

false), dc-aware.local-dc

All: https://prestodb.io/docs/current/connector/cassandra.html

Scylla Example - CQL

CREATE KEYSPACE mykeyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };

use mykeyspace ;CREATE TABLE users (user_id int PRIMARY KEY, fname text, lname text);

insert into users (user_id , fname, lname) values (1, 'tzach', 'livyatan');insert into users (user_id , fname, lname) values (2, 'dor', 'laor'); insert into users (user_id , fname, lname) values (3, 'shlomi', 'laor'); insert into users (user_id , fname, lname) values (4, 'shlomi', 'livne'); insert into users (user_id , fname, lname) values (6, 'avi', 'kivity');

Scylla / Presto Example CREATE TABLE air_quality_data (

... sensor_id text,

... time timestamp,

... co_ppm int,

... PRIMARY KEY (sensor_id, time)

... );

INSERT INTO air_quality_data(sensor_id, time, co_ppm) VALUES ('my_home', '2016-08-30 07:01:00', 17);INSERT INTO air_quality_data(sensor_id, time, co_ppm) VALUES ('my_home', '2016-08-30 07:01:01', 18);

Scylla / Presto Example> select histogram(co_ppm) as hist from cassandra.mykeyspace.air_quality_data where sensor_id='my_home';

hist --------------------------------{17=1, 18=1, 19=1, 20=2, 31=1}

select sensor_id, avg(co_ppm) as AVG from cassandra.mykeyspace.air_quality_data group by sensor_id;sensor_id | avg-----------+--------------------your_home | 629.2857142857143 my_home | 20.833333333333332

Try for yourself!docker run --name some-scylla-presto -d tzachl/scylla-and-presto-image

docker exec -it some-scylla-presto cqlsh

cqlsh>

docker exec -it some-scylla-presto ./presto --server localhost:8080 --

catalog cassandra --schema default

presto:default>

Presto SQL - Aggregate Functions examples

• checksum - an order-insensitive checksum of the given values.• count - number of input rows.• count_if - number of TRUE input values• max_by(x, y) - value of x associated with the maximum value of y over all

input values.• histogram(x) - a map containing the count of the number of times each input

value occurs.• approx_distinct(x) - approximate number of distinct input values. This

function provides an approximation of count(DISTINCT x). Zero is returned if all input values are null.

• corr(y, x) - correlation coefficient of input values.• stddev_samp(x) - sample standard deviation of all input values.

AirpalAirbnb A web-based query execution tool built on top of Facebook's PrestoDB.

Source https://blogs.aws.amazon.com/bigdata/post/Tx1BF2DN6KRFI27/Analyze-Data-with-Pres to-and-Airpal-on-Amazon-EMR

Whats Next?

• Try it yourself https://hub.docker.com/r/tzachl/scylla-and-presto-image

• Performance testing• Future Optimizations▪ Push query lower▪ Pull data faster

Agenda

• What is Spark?• Why Spark?• Scylla + Spark • What’s Next?

What is Spark?

What if I don’t Know?

Find me at the happy hour tonight, we have beer and wine!

Why Spark and Scylla?

• Faster Analytics with In-memory execution

• Close to Real-Time analytics on transactional data

• Iterative

• Resource efficiency for multiple workloads

Spark Architecture

Source: http://spark.apache.or g/docs/latest/cluster-overview.html

Spark & Spark RDDs @ ScyllaDB (Resilient Distributed Datasets)

• Understand your data modelCREATE TABLE sensordata (s1data text,s2data text,timestmp timestamp,sregion

text,stype text, PRIMARY KEY (s1data, s2data,timestmp));

CREATE TABLE sensordata (s1data text,s2data text,timestmp timestamp,sregion

text,stype text, PRIMARY KEY (timestmp, s1data, s2data));

Data Model Impact on Data Load

cassandra-stress user profile=myexample.yaml no-warmup ops$insert=1$ n=1000000 -rate threads=1000 -node $node

PRIMARY KEY (s1data, s2data,timestmp)); PRIMARY KEY (timestmp, s1data, s2data));

Try Maintain Data Locality - Colo.

Data Locality - Dedicated ClustersReview your settings for:

Spark.locality.wait.* (node, process, rack)

→ Default is 3s

Network Speed and Latency

Spark & ScyllaDB, CPU Settings

@Scylla

@Spark

--cpu-set

SPARK_WORKER_CORES

--smp

• Divide system cores based on expected workload

Spark & Scylla, Memory Settings(Resilient Distributed Datasets)

@SparkSPARK_WORKER_MEMORY , per worker node

spark.executor.memory , When you set your specific application

memory consumption

@ScyllaWill just take whatever you give us :)

Spark Building and C* Connector

• Environment used in above examples:▪ AWS i2.8xlarge x3 for Scylla & Spark (colo)

- Scylla 1.3, Spark 1.6.2- Mvn 3.3.9- Building Spark standalone cluster on EC2

▪ Each server has 32 cores- 24 cores → Scylla- 8 cores → Spark

▪ Each server has 244GB RAM- 64GB for each Spark worker- 128GB for Scylla

• Use Apache Cassandra connector Ver. 1.6

• Using Spark Standalone Cluster deployment

• Think Resources - CPU and Memory

• Better modeling, easier deployment, faster analytics

• Data locality can be a blessing if managed correctly▪ Scylla’s optimal sharding enable data ingestion without

compromising analytics performance

Whats Next?

• Try it yourself ▪ Here is how to do it:

http://www.scylladb.com/kb/scylla-and-spark-

integration/

• Performance testing and use cases

Tell us about your experience with Scylla and Spark

Thank You!Try Presto with Scylla now:

https://hub.docker.com/r/tzachl/scylla-

and-presto-image/

Contact: [email protected]