Presto and ScyllaDB
Tzach Livyatan Eyal GutkindScyllaDB ScyllaDB
Agenda
• What is Presto?• Why Presto?• Scylla + Presto
▸ Connector▸ Examples
• What’s Next?
What is Presto?
• Distributed ANSI SQL Query Engine for Big Data
• Developed by Facebook in 2012
• Data sources: HDFS, S3, Cassandra, MySQL, Kafka,
PostgreSQL, Redis and Scylla
• Open Source
• Java Base
Why Presto?
PrestoInteractive Data Exploration
Source http://techblog.netflix.com/2014/10/usi ng-pres to-in-our-big-data-platform.html
SparkReal Time AnalyticsMachine LearningIterative
Why Presto?
• ANSI SQL
• Extensible - multiple data
sources
• Fast (compare to hive)
• Custom engine designed to
support SQL semanticsSource http://techblog.netflix.com/2014/10/usi ng-pres to-in-our-big-data-platform.html
Who use Presto ?
Presto Architecture
Source https://en.wikipedia.org/wiki/Pres to_(SQL_query_engine)
Presto Cassandra Connector
• DataStream API: ▪ CQL
• DataLocation API▪ Thrift: describe_ring and describe_splits_ex verbs
If number of partitions > 200
• Metadata API▪ CQL: get table layout
Presto Cassandra Connector -Configuration
• cassandra.contact-points• cassandra.consistency-level• cassandra.username / cassandra.password• limit-for-partition-key-select• cassandra.fetch-size• cassandra.load-policy.use-dc-aware (default
false), dc-aware.local-dc
All: https://prestodb.io/docs/current/connector/cassandra.html
Scylla Example - CQL
CREATE KEYSPACE mykeyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
use mykeyspace ;CREATE TABLE users (user_id int PRIMARY KEY, fname text, lname text);
insert into users (user_id , fname, lname) values (1, 'tzach', 'livyatan');insert into users (user_id , fname, lname) values (2, 'dor', 'laor'); insert into users (user_id , fname, lname) values (3, 'shlomi', 'laor'); insert into users (user_id , fname, lname) values (4, 'shlomi', 'livne'); insert into users (user_id , fname, lname) values (6, 'avi', 'kivity');
Scylla Example - Presto SQL
SELECT * FROM cassandra.mykeyspace.users where user_id >= 2 and user_id <= 3;
user_id | fname | lname ---------+--------+-------
2 | dor | laor 3 | shlomi | laor
(2 rows)
Scylla / Presto Example CREATE TABLE air_quality_data (
... sensor_id text,
... time timestamp,
... co_ppm int,
... PRIMARY KEY (sensor_id, time)
... );
INSERT INTO air_quality_data(sensor_id, time, co_ppm) VALUES ('my_home', '2016-08-30 07:01:00', 17);INSERT INTO air_quality_data(sensor_id, time, co_ppm) VALUES ('my_home', '2016-08-30 07:01:01', 18);
Scylla / Presto Example> select histogram(co_ppm) as hist from cassandra.mykeyspace.air_quality_data where sensor_id='my_home';
hist --------------------------------{17=1, 18=1, 19=1, 20=2, 31=1}
select sensor_id, avg(co_ppm) as AVG from cassandra.mykeyspace.air_quality_data group by sensor_id;sensor_id | avg-----------+--------------------your_home | 629.2857142857143 my_home | 20.833333333333332
Scylla / Presto Example - Joincqlsh:mykeyspace> CREATE TABLE address (user_id int PRIMARY KEY, number
int, street text, city text);
cqlsh:mykeyspace> insert into address (user_id , number, street) values (1,
'dizingof', 'tel aviv');
presto:default> select * from cassandra.mykeyspace.users us JOIN
cassandra.mykeyspace.address ad ON us.user_id = ad.user_id;
user_id | fname | lname | user_id | city | number | street
---------+-------+----------+---------+----------+--------+----------
1 | tzach | livyatan | 1 | tel aviv | 100 | dizingof
Try for yourself!docker run --name some-scylla-presto -d tzachl/scylla-and-presto-image
docker exec -it some-scylla-presto cqlsh
cqlsh>
docker exec -it some-scylla-presto ./presto --server localhost:8080 --
catalog cassandra --schema default
presto:default>
Presto SQL - Aggregate Functions examples
• checksum - an order-insensitive checksum of the given values.• count - number of input rows.• count_if - number of TRUE input values• max_by(x, y) - value of x associated with the maximum value of y over all
input values.• histogram(x) - a map containing the count of the number of times each input
value occurs.• approx_distinct(x) - approximate number of distinct input values. This
function provides an approximation of count(DISTINCT x). Zero is returned if all input values are null.
• corr(y, x) - correlation coefficient of input values.• stddev_samp(x) - sample standard deviation of all input values.
AirpalAirbnb A web-based query execution tool built on top of Facebook's PrestoDB.
Source https://blogs.aws.amazon.com/bigdata/post/Tx1BF2DN6KRFI27/Analyze-Data-with-Pres to-and-Airpal-on-Amazon-EMR
Whats Next?
• Try it yourself https://hub.docker.com/r/tzachl/scylla-and-presto-image
• Performance testing• Future Optimizations▪ Push query lower▪ Pull data faster
Agenda
• What is Spark?• Why Spark?• Scylla + Spark • What’s Next?
What is Spark?
What if I don’t Know?
Find me at the happy hour tonight, we have beer and wine!
Why Spark and Scylla?
• Faster Analytics with In-memory execution
• Close to Real-Time analytics on transactional data
• Iterative
• Resource efficiency for multiple workloads
Spark Architecture
Source: http://spark.apache.or g/docs/latest/cluster-overview.html
Spark & Spark RDDs @ ScyllaDB (Resilient Distributed Datasets)
• Understand your data modelCREATE TABLE sensordata (s1data text,s2data text,timestmp timestamp,sregion
text,stype text, PRIMARY KEY (s1data, s2data,timestmp));
CREATE TABLE sensordata (s1data text,s2data text,timestmp timestamp,sregion
text,stype text, PRIMARY KEY (timestmp, s1data, s2data));
Data Model Impact on Data Load
cassandra-stress user profile=myexample.yaml no-warmup ops\(insert=1\) n=1000000 -rate threads=1000 -node $node
PRIMARY KEY (s1data, s2data,timestmp)); PRIMARY KEY (timestmp, s1data, s2data));
Try Maintain Data Locality - Colo.
Data Locality - Dedicated ClustersReview your settings for:
Spark.locality.wait.* (node, process, rack)
→ Default is 3s
Network Speed and Latency
Spark & ScyllaDB, CPU Settings
@Scylla
@Spark
--cpu-set
SPARK_WORKER_CORES
--smp
• Divide system cores based on expected workload
Spark & Scylla, Memory Settings(Resilient Distributed Datasets)
@SparkSPARK_WORKER_MEMORY , per worker node
spark.executor.memory , When you set your specific application
memory consumption
@ScyllaWill just take whatever you give us :)
Spark Building and C* Connector
• Environment used in above examples:▪ AWS i2.8xlarge x3 for Scylla & Spark (colo)
- Scylla 1.3, Spark 1.6.2- Mvn 3.3.9- Building Spark standalone cluster on EC2
▪ Each server has 32 cores- 24 cores → Scylla- 8 cores → Spark
▪ Each server has 244GB RAM- 64GB for each Spark worker- 128GB for Scylla
• Use Apache Cassandra connector Ver. 1.6
• Using Spark Standalone Cluster deployment
• Think Resources - CPU and Memory
• Better modeling, easier deployment, faster analytics
• Data locality can be a blessing if managed correctly▪ Scylla’s optimal sharding enable data ingestion without
compromising analytics performance
Whats Next?
• Try it yourself ▪ Here is how to do it:
http://www.scylladb.com/kb/scylla-and-spark-
integration/
• Performance testing and use cases
Tell us about your experience with Scylla and Spark
Thank You!Try Presto with Scylla now:
https://hub.docker.com/r/tzachl/scylla-
and-presto-image/
Contact: [email protected]
Top Related