One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL ... · MongoDB, MySQL and Redis with...
Transcript of One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL ... · MongoDB, MySQL and Redis with...
Tim Vaillancourt Sr. Technical Operations Architect
One Tool to Rule Them All: Seamless SQL on MongoDB, MySQL and Redis with Apache Spark
About Me• Joined Percona in January 2016• Sr Technical Operations Architect for MongoDB• Previous:
• EA DICE (MySQL DBA)• EA SPORTS (Sys/NoSQL DBA Ops)• Amazon/AbeBooks Inc (Sys/MySQL+NoSQL DBA Ops)
• Main techs: MySQL, MongoDB, Cassandra, Solr, Redis, queues, etc• 10+ years tuning Linux for database workloads (off and on) • NOT an Apache Spark expert
Apache Spark• “…is a fast and general engine for large-scale data processing” • Written in Scala, utilises Akka streaming framework and runs under Java JVM• Supports jobs written in Java, Scala, Python and R• Pluggable datasources for various file types, databases, SaaSs, etc• Fast and efficient: jobs work on datasources quickly in parallel• Optional clustering
• Master/Slave Spark cluster, Zookeeper for Master elections• Slave workers connect to master• Master distributes tasks evenly to available workers
• Streaming and Machine Learning (MLib) capabilities• Programatic and SQL(!) querying capabilities
Apache Spark: Software Architecture• Jobs go to Cluster Master or runs in the
client JVM directly• Cluster Master directs jobs to nodes
with available resources with messages• Cluster Master HA
• Slaves reconnect to Master• Apache Zookeeper for true HA
Apache Spark: Hadoop Comparison• Hadoop MapReduce
• Batch-based• Uses data less efficiently• Relatively hard to develop/maintain
• Spark• Stream Processing• Fast/Parallelism
• Prefers memory as much as possible in jobs• Divides work into many lightweight sub-tasks in threads
• Datasources• Uses datasource-awareness to scale (eg: indices, shard-awareness, etc)
• Spark allows processing and storage to scale separately
Apache Spark: RDDs and DataFrames• RDD: Resilient Distributed Dataset
• Original API to access data in Spark• Lazy: does not access data until a real action is performed• Spark’s optimiser cannot see inside• RDDs are slow on Python
• DataFrames API• Higher level API, focused on the “what” is being done• Has schemas / table-like• Interchangeable Programming and SQL APIs• Much easier to read and comprehend• Optimises execution plan
Apache Spark: Datasources• Provides a pluggable mechanism for accessing structured data though Spark SQL• At least these databases are supported in some way
• MySQL• MongoDB• Redis• Cassandra• Postgres• HBase• HDFS• File• S3
• In practice: search GitHub, find .jar file, deploy it!
Apache Spark: SQLContext• ANSI SQL
• 30+ year old language..• Easy to understand• Everyone usually knows it
• Spark SQLContext• A Spark module for structured data processing, wrapping RDD API• Uses the same execution engine as the programatic APIs• Supports:
• JOINs/unions• EXPLAINs,• Subqueries,• ORDER/GROUP/SORT BYs• Most datatypes you’d expect
Apache Spark: Use Cases• Business Intelligence/Analytics
• Understand • Tip: use dedicated replicas for expensive queries!
• Data Summaries and Batch Jobs• Perform expensive summaries in the background,
save result• Tip: use burstable/cloud hardware for infrequent
batch jobs• Real-time Stream Processing
• Process data as it enters your system
So why not Apache Drill?• A schema-free SQL engine for Hadoop, NoSQL and Cloud Storage• Drill does not support / work with
• Relational databases (MySQL) or Redis• No programatic-level querying• No streaming/continuous query functionality
• I don’t know much about it
The Demo• Scenario: You run a Weather Station data app that stores data in both an
RDBMs and a document store• Goal: summarise weather station data stored in an RDBMs and a
Document store• Min Water Temperature• Avg Water Temperature• Max Water Temperature• Total Sample Count• Get Top-10 (based on avg water temp)
The Demo• RDBMs: Percona Server for MySQL 5.7
• Stores the Weather station metadata data (roughly 350 stations: ID, name, location, etc)
• Document-Store: Percona Server for MongoDB 3.2• Stores the Weather time-series sample data (roughly 80,000 samples:
various weather readings from stations)• In-Memory K/V Store: Redis 2.8
• Store summarised Top-10 data for fast querying of min, avg, max temperature and total sample counts
The Demo• Apache Spark 1.6.2
Cluster• 1 x Master• 2 x Worker/Slaves
• 1 x Pyspark Job• 1 x Macbook Pro• 3 x Virtualbox VMs• Job submitted on Master
The Demo: The Pyspark Job
<- MongoDB Collection as SQLContext Temp Table
<- New Redis Hash Schema as SQLContext Temp Table