Meetup#4, Apache Spark as SQL Engine

Apache Spark as SQL EngineData Engineering Approach

Dmitry Timofeev, Data Analyst, Wrike Inc.

Wrike is a collaborative task and project management platform

wrike.com

What is Apache Spark?

• Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. • Write applications quickly in Java, Scala, Python, R. • Combine SQL, streaming, and complex analytics. • Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data

sources including HDFS, Cassandra, HBase, and S3.

Apache Spark™ is a fast in-memory and general engine for large-scale data processing.

Where it came from?Original white papers

• "Spark: Cluster Computing with Working Sets" by Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. University of California, Berkeley

• "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing" Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. University of California, Berkeley

Few words about data analystsOr why they don’t want to write code and

want query, query, query?

• We know SQL • We love ETL

Spark SQLSpark SQL is Spark's module for working with structured data.

• DataFrame and seamlessly mix SQL queries with Spark programs;

• Connect to any data source the same way: Hive, Avro, Parquet, JSON and JDBC;

• Server mode: connect to Spark SQL with you favorite DB client over JDBC.

Spark SQLDistributed SQL Engine. Integration with BI tools

Spark SQLDistributed SQL Engine and my favorite DB tool

Spark SQLData sources

Spark SQLMix SQL queries with Spark programs

Where it came from?Original white papers

• "Spark SQL: Relational Data Processing in Spark" by Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan‡, Michael J. Franklin‡, Ali Ghodsi, Matei Zaharia. Databricks Inc. MIT CSAIL, AMPLab, UC Berkeley

Conclusion• You can easy crate scalable infrastructure; • Do you dream about cross-DB joins?

Welcome! • Do you want to join logs and usual DBs?

Welcome! • You analysts is not a programmers? Not a

problem!

Your questions?

To make our team more awesome we need: UX Data Analyst

Billing Operations Analyst Data Engineer

hr-spb@team.wrike.com

Meetup#4, Apache Spark as SQL Engine

Data & Analytics

Transcript of Meetup#4, Apache Spark as SQL Engine

Budapest Spark Meetup - Apache Spark @enbrite.ly

[Spark meetup] Spark Streaming Overview

Spark meetup v2.0.5

2015-11-12 - Advanced Apache Spark Meetup @ Thumbtack

Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014

Spark meetup TCHUG

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup

Apache Spark with Hortonworks Data Platform - Seattle Meetup

Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Introducing apache prediction io (incubating) (bay area spark meetup at salesforce)

Reading Cassandra Meetup Feb 2015: Apache Spark

Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group

Dublin Spark Meetup - Meetup 1 - Intro to Spark

Spark sql meetup

Databricks Meetup @ Los Angeles Apache Spark User Group

Spark Streaming @ Berlin Apache Spark Meetup, March 2015

Apache Spark part of Eindhoven Java Meetup

Apache spark meetup

IBM Spark Meetup - RDD & Spark Basics

Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22